CoGI: Towards Compressing Genomes as an Image

Xiaojing Xie, Shuigeng Zhou and Jihong Guan

  1. Introduction
    1. For details of algorithms, please download the original paper.
    2. This tool now can only run with the linux operation system. To know how to run the tool, please run with '-h' first.

  2. Compile CoGI
    1. Download
      You can find and download document, source code, test datasets and other datasets used in the original paper at:
      http://admis.fudan.edu.cn/projects/cogi.htm
      source code: cogi.zip
    2. Compiling
      1. System requirements:
        A C/C++ compiler and the STL library. In the Makefile, we use g++ as the default compiler command.
      2. run "make" in the terminal, all the executable files are in the "bin" directory.

  3. How to Use CoGI
    1. Compression
      1. compress both the reference and non-references:
        compress -n < sequence count> < reference> <seq path1> ...
      2. compress only the non-references:
        compress -ur -n < sequence count> < reference> <seq path1> ...
      3. compress using hash:
        compress -ur --hash -n < sequence count> < reference> < seq path1> ...
      4. compress selected a reference using MSE:
        compress --mse -n < sequence count> < reference> < seq path1> ...
      5. compress selected a reference using BCE:
        compress --bce -n < sequence count> < reference> < seq path1> ...
      Output:
      a) file named 'compressed';
      b) files with extension '.patch', corrsponding to the compressed files.
    2. Uncompession
      1. uncompess sequences provided reference file:
        uncompress -r < reference file> -l <length of one line>
      2. uncompress both the sequences and reference(default):
        uncompress
      Output:
      each sequence in one file, named 1.fastq, 2.fastq,...

  4. Example
    Note that the sequences are all converted to lowercase before we run the program. Here we provide the test datasets:
    korean0131chr1.fastq | | korean0224chr1.fastq | tair8chr1.fastq | tair9chr1.fastq
    1. compress and uncompress the Chromosome. 1 of KOREF_20090224 using KOREF 20090131 as reference:
      $ bin/compress -ur -n 2 korean0131chr1.fastq korean0224chr1.fastq
      output: compressed, 1.patch
      (-l 60: the length of each line is set to 60)
      $ bin/uncompress -r korean0131chr1.fastq -l 60
      output: 1.fastq
      $ diff 1.fastq korean0224chr1.fastq
    2. compress and uncompress Chromosome 1 of TAIR9 using TAIR8 as reference:
      (--hash: using hash for alignment)
      $ bin/compress -ur --hash -n 2 tair8chr1.fastq tair9chr1.fastq
      output: compressed, 1.patch
      (-l 79: the length of each line is set to 79)
      $ bin/uncompress -l 79 -r tair8chr1.fastq
      output: 1.fastq
      $ diff 1.fastq tair9chr1.fastq

  5. Data set

If you have problems please contact xiexiaojing@fudan.edu.cn or sgzhou@fudan.edu.cn