User cookbook

In order to process BAM files and generate file index, the software samtools and tabix need to be installed in the system.

The package sequenza-utils includes several programs and it should support the generation of seqz files using commonly available input files, such as fasta, BAM and vcf files.

To write your own program using the sequenza-utils library, lease refer to the API library interface

Generate GC reference file

The GC content source required to generate seqz files must to be in the wiggle track format (WIG). In order to generate the wig file from any fasta file use the gc_wiggle program.

sequenza-utils gc_wiggle --fasta genome.fa.gz -w 50 -o genome_gc50.wig.gz

From BAM files

Normal and tumor BAM files

sequenza-utils bam2seqz --normal normal_sample.bam --tumor tumor_sample.bam \
    --fasta genome.fa.gz -gc genome_gc50.wig.gz --output sample.seqz.gz

Normal and tumor pileup files

sequenza-utils bam2seqz --normal normal_sample.pielup.gz \
    --tumor tumor_sample.pielup.gz --fasta genome.fa.gz \
    -gc genome_gc50.wig.gz --output sample.seqz.gz --pileup

Without normal, workaround

sequenza-utils bam2seqz --normal tumor_sample.bam --tumor tumor_sample.bam \
    --normal2 non_matching_normal_sample.bam --fasta genome.fa.gz \
    -gc genome_gc50.wig.gz --output sample.seqz.gz

Binning seqz, reduce memory

sequenza-utils seqz_binning --seqz sample.seqz.gz --window 50 \
    -o sample_bin50.seqz.gz

Seqz from VCF files

VCF files with DP and AD tags

sequenza-utils snp2seqz --vcf sample_calls.vcf.gz -gc genome_gc50.wig.gz \
    --output samples.seqz.gz

Mutect/Caveman/Strelka2 preset

sequenza-utils snp2seqz --vcf sample_calls.vcf.gz -gc genome_gc50.wig.gz \
    --preset mutect --output samples.seqz.gz
sequenza-utils snp2seqz --vcf sample_calls.vcf.gz -gc genome_gc50.wig.gz \
    --preset caveman --output samples.seqz.gz
sequenza-utils snp2seqz --vcf sample_calls.vcf.gz -gc genome_gc50.wig.gz \
    --preset strelka2_som --output samples.seqz.gz

Merge seqz files

Non overlapping calls (eg different chromosomes)

gzcat sample_chr1.seqz.gz sample_chr1.seqz.gz | \
    gawk '{if (NR!=1 && $1 != "chromosome") {print $0}}' | bgzip > \
    sample.seqz.gz
tabix -f -s 1 -b 2 -e 2 -S 1 sample.seqz.gz

Overlapping sample_calls

sequenza-utils seqz_merge --seqz1 sample_somatic.seqz.gz \
    --seqz2 sample_snps.seqz.gz --output samples.seqz.gz