DNABIT Compress – Genome compression algorithm

Data compression is concerned with how information is organized in data. Efficient storage means removal of redundancy from the data being stored in the DNA molecule. Data compression algorithms remove redundancy and are used to understand biologically important molecules. We present a compression algorithm, “DNABIT Compress” for DNA sequences based on a novel algorithm of assigning binary bits for smaller segments of DNA bases to compress both repetitive and non repetitive DNA sequence. Our proposed algorithm achieves the best compression ratio for DNA sequences for larger genome. Significantly better compression results show that “DNABIT Compress” algorithm is the best among the remaining compression algorithms. While achieving the best compression ratios for DNA sequences (Genomes),our new DNABIT Compress algorithm significantly improves the running time of all previous DNA compression programs. Assigning binary bits (Unique BIT CODE) for (Exact Repeats, Reverse Repeats) fragments of DNA sequence is also a unique concept introduced in this algorithm for the first time in DNA compression. This proposed new algorithm could achieve the best compression ratio as much as 1.58 bits/bases where the existing best methods could not achieve a ratio less than 1.72 bits/bases.


BIT Technique:
In the DNA sequence if there exists more than 3 repeats upto 8 repeats ( 4,5,6,7,8) Three to eight similar bases next to one another, this 5 Bit technique is applied. The encoded string is represented as 5 Bit CODE (Table 4-8).

BIT Technique: BIT CODE for 2 EXACT BASE REPEATS:
In the given string if there is 2 characters repeat more than 1time upto 8 times, we represent it in 7 bit sequence. In this 7 bit code, first 3 bits represent the number of repeats of that characters. The other 4 bits represent the code for that characters (Table 9).

BIT Technique:
In this 9 BIT CODE, there are two techniques. If the consecutive 4 bases are same, then the encoded string is taken to be 9 bit CODE. The first significant bit either represents as "0"or "1". "0" indicates that the repeat is exact repeat. "1" indicates that the repeat is reverse repeat. 9 BIT Technique: Significant bit a g c t In the 9 bits the first significant bit represents same or reverse. The other 8 bits represents CODE for each base.

Software
In the above sequence "t" repeat is 6 times. So binary representation of 6 is "101". The first 3 bits are code for 6. The other 2 bits are the code for character "t". 11110 = "g" is repeat in 8 times binary code for 8 is "111". The next 10 is code for "g". Procedure Encode: Begin 1: Divide the given DNA sequence in to fragments, where each fragment consists of 2 characters, 4 characters. 2: Generate all possible combinations of DNA sequence (A, C, G, T). 3: Apply Even Bit Technique if the simultaneous bases do not match with each other. (The DNA sequence is assigned two bits for every individual base of non-repeat regions.) 4: If there exists two or three similar bases next to one another, the 3 Bit technique is applied. 5: If there exists more than 3 repeats upto 8 repeats (4,5,6,7,8) Three to eight similar bases next to one another, the 5 Bit technique is applied. The encoded string is represented as 5 Bit CODE. 6: In the given string if there is 2 characters repeat more than 1 time upto 8 times, it is represented as 7 bit code. In this 7 bit code, first 3 bits represent the number of repeats of that character. (The other 4 bits represent the code for that character.) 7: If the consecutive 4 bases are same, then the encoded string is taken to be 9 bit CODE. The first significant bit either represents as "0"or "1". "0" indicates that the repeat is Exact repeat. "1" indicates that the repeat is reverse repeat. 8: Transfer the binary bits to the output String (OUTSTRING). End The Decryption algorithm involves the same procedure as Encryption in the reverse form.

DNABIT COMPRESS DECODING ALGORITHM:
Input: Input String Output: Decoded String (DECSTRING) Procedure Decode: Begin 1: Generate all possible combinations for {A, C, G, T}. 2: Allocate unique binary bit number (0 and 1) to each combination. 3: Divide given binary code in to segments. 4: According to the Binary code (either 3 BIT CODE, 5 BIT CODE, 7 BIT CODE, or 9 BIT CODE) assign appropriate base {a, c, g, t}. 5: Repeat step 4, until the end of the input sequence is reached. 6: If there are any individual bases (non repeat regions) the corresponding binary code gets transformed. (Assigned values for bases are :a="00", G="01",c="10",t="11"). End

Methodology of DNABIT COMPRESS:
The length of the DNA sequence is divided in to fragments of four and two. Each fragment (ACGT) is replaced with binary code (0 or 1). Then the total number of bits required to encode the DNA sequence obtained is shown below. The Total number of bits per base (Ř) is calculated as following: Thus our proposed algorithm DNABIT compress has the following advantages: i) Compression ratio of 1.5359 bits per base compared to 1.76 bits per base for the other DNA compression algorithms. ii) Because the method doesn't use dynamic programming technique which was used by other methods e.g., BioCompress, GenCompress etc, it is simple and takes less execution time.