Musket - a multistage k-mer spectrum based corrector

Introduction

Musket is a well-established leading next-generation sequencing read error correction algorithm targetting Illumina sequencing. This corrector employs the k-mer spectrum approach and introduces three correction techniques in a multistage workflow. Our performance evaluation results, in terms of correction quality and de novo genome assembly measures, reveal that Musket is consistently one of the top performing substitution-error-based correctors. In addition, Musket is multi-threaded using a master-slave model and demonstrates superior parallel scalability compared to all other evaluated correctors as well as a highly competitive overall execution time.

Downloads

Latest source code (release 1.1)
more details about the changes in this version are availabe at changelog.
Other tools
- memusage: a program for peak virtual and resident memory calculation on Linux.

Citation

Yongchao Liu, Jan Schroeder and Bertil Schmidt: " Musket: a multistage k-mer spectrum based error corrector for Illumina sequence data". Bioinformatics, 2013, 29(3): 308-315

Other related papers

Yongchao Liu, Bertil Schmidt, Douglas L. Maskell: "DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI". BMC Bioinformatics, 2011, 12:85.
Adrianto Wirawan, Robert S Harris, Yongchao Liu, Bertil Schmidt and Jan Schroeder: "HECTOR: A parallel multistage homopolymer spectrum based error corrector for 454 sequencing data." BMC Bioinformatics, 2014, 15:131

Parameters

Basic:

-k <int uint> (specify two paramters: k-mer size and estimated total number of k-mers for this k-mer size)
The first value is k-mer size and its default value is set to 21. The second values specifies the (possible) number of distinct k-mers in the input datasets (e.g. estimated number of k-mers: 67108864, 134217728, 268435456 or 536870912) and its default value is set to 536870912. This second value does not affect the correctness of the program, but balances the memory consumption between Bloom filters and hash tables, thus affecting the overall memory footprint of the program. Please feel free to specify this value.
-o <str> (the single output file name)
When this option is used, all corrected reads will be output to this file, regardless of how many input files are given. This option is mutually exclusive with optoin "-omulti".
-omulti <str> (prefix of output file names, one input corresponding one output)
When this option is used, each input file will have its own output file. For each input file, all corrected reads in the file will be ouput to a file with prefix #str (the value of this option). The first input file has an output file name #str.0, the second #str.1, and so on. For example, the command "musket -omutli myout infile_1.fa infile_2 fa infile2_1.fa infile2_2.fa" will create 4 output files with names "myout.0", "myout.1", "myout.2" and "myout.3". File "myout.0" corresponds to file "infile_1.fa" and "myout.1" corresponds to "infile_2.fa", and so on
-p <int> (number of threads [>=2], default 2)
-zlib <int> (zlib-compressed output, default=0)
-maxtrim <int> (maximal number of bases that can be trimmed, default 0)
-inorder (keep sequences outputed in the same order with the input)
-lowercase (write corrected bases in lowercase, default 0);

Advanced:

-maxbuff < int> (capacity of message buffer for each worker, default 1024)
-multik < bool> (enable the use of multiple k-mer sizes, default 0)
-maxerr < int> (maximal number of mutations in any region of length #k, default 4)
-maxiter < int> (maximal number of correcting iterations per k-mer size, default 2)
-minmulti < int> (minimum multiplicty for correct k-mers [only applicable when not using multiple k-mer sizes], default=0)

Installation and Usage

Compilation

Type "make" in the root dirctory of the software to compile it.

Paired-end correction

Musket works with FASTA and FASTQ file formats , including gzip compressed files.

We have simplified the paired-end read correction from release 1.0.5, where only two options "-omulti" and "-inorder" are requred.

The option "-omulti" enables each input file to have its own output file name with prefix specified by this option.
The option "-inorder" eanbles all reads in any input file are output in the same order as in the input.
For example, the command
- musket -omutli myout -inorder infile_1.fa infile_2 fa infile2_1.fa infile2_2.fa
will create 4 output files with names "myout.0", "myout.1", "myout.2" and "myout.3". File "myout.0" corresponds to file "infile_1.fa" and "myout.1" corresponds to "infile_2.fa", and so on.

For old versions (<=1.0.4), click here to read the manual for paired-end corrections.

Maximum sequence length

In the Makefile, we have used the macro "MAX_SEQ_LENGTH" to set the maximum sequence length supported. The default value is 200 for the present time. In order to support longer sequences, please change the value of this macro and re-compile the code accordingly.

Maximum k-mer size

In the Makefile, we have used the macro "MAX_KMER_SIZE" to set the maximum k-mer size supported. The default value is 28 for the present time. In order to support larger k-mer sizes, please change the value of this macro and re-compile the code accordingly.

Typical Commands

musket -k 21 536870912 -p 12 reads.fa reads2.fa -o output.fa
musket -p 12 reads.fa reads2.fa -o output.fa
musket reads.fa reads2.fa -o output.fa -inorder
musket reads_1.fa reads_2.fa reads1_1.fa reads1_2.fa -omulti output -inorder

Change Log

Feburary 22, 2013 (release 1.0.6)
1. Add a new option "-lowercase" to allow users to select whether writing all corrected bases in lower case.If specified, all corrected bases will be output in lowercase; otherwise, in uppercase.
December 23, 2012 (release 1.0.5)
1. Add a new option "-omulti" to output the corrected reads to multiple output files with prefix specifed by this option. When using this option, an input file has its own outpt file, instead of a single output file.
2. Removed the option "-paired", since it is equivalent to option "-inorder".
November 9, 2012 (release 1.0.4)
1. Fixed a bug in the Bloom filter implementation. This bug causes the program to consume more memory than expected.

Contact

If any questions or improvements, please feel free to contact Liu, Yongchao.

Musket - a multistage k-mer spectrum based corrector

Site Map

Project Links

List of My Software

Big Data

Machine Learning

Scientific Computing

Sequence Alignment

Motif Discovery

NGS Read Alignment

NGS Read Error Correction

NGS de novo Assembly

NGS SNV calling

NGS Metagenomics

Inspire Innovation