User cookbook¶

In order to process BAM files and generate file index, the software samtools and tabix need to be installed in the system.

The package sequenza-utils includes several programs and it should support the generation of seqz files using commonly available input files, such as fasta, BAM and vcf files.

To write your own program using the sequenza-utils library, lease refer to the API library interface

Generate GC reference file¶

The GC content source required to generate seqz files must to be in the wiggle track format (WIG). In order to generate the wig file from any fasta file use the gc_wiggle program.

sequenza-utils gc_wiggle --fasta genome.fa.gz -w 50 -o genome_gc50.wig.gz

From BAM files¶

Normal and tumor BAM files¶

sequenza-utils bam2seqz --normal normal_sample.bam --tumor tumor_sample.bam \
    --fasta genome.fa.gz -gc genome_gc50.wig.gz --output sample.seqz.gz

Normal and tumor pileup files¶

sequenza-utils bam2seqz --normal normal_sample.pielup.gz \
    --tumor tumor_sample.pielup.gz --fasta genome.fa.gz \
    -gc genome_gc50.wig.gz --output sample.seqz.gz --pileup

Without normal, workaround¶

sequenza-utils bam2seqz --normal tumor_sample.bam --tumor tumor_sample.bam \
    --normal2 non_matching_normal_sample.bam --fasta genome.fa.gz \
    -gc genome_gc50.wig.gz --output sample.seqz.gz

Binning seqz, reduce memory¶

sequenza-utils seqz_binning --seqz sample.seqz.gz --window 50 \
    -o sample_bin50.seqz.gz

Seqz from VCF files¶

VCF files with DP and AD tags¶

sequenza-utils snp2seqz --vcf sample_calls.vcf.gz -gc genome_gc50.wig.gz \
    --output samples.seqz.gz

Mutect/Caveman/Strelka2 preset¶

sequenza-utils snp2seqz --vcf sample_calls.vcf.gz -gc genome_gc50.wig.gz \
    --preset mutect --output samples.seqz.gz

sequenza-utils snp2seqz --vcf sample_calls.vcf.gz -gc genome_gc50.wig.gz \
    --preset caveman --output samples.seqz.gz

sequenza-utils snp2seqz --vcf sample_calls.vcf.gz -gc genome_gc50.wig.gz \
    --preset strelka2_som --output samples.seqz.gz

Merge seqz files¶

Non overlapping calls (eg different chromosomes)¶

gzcat sample_chr1.seqz.gz sample_chr1.seqz.gz | \
    gawk '{if (NR!=1 && $1 != "chromosome") {print $0}}' | bgzip > \
    sample.seqz.gz
tabix -f -s 1 -b 2 -e 2 -S 1 sample.seqz.gz

Overlapping sample_calls¶

sequenza-utils seqz_merge --seqz1 sample_somatic.seqz.gz \
    --seqz2 sample_snps.seqz.gz --output samples.seqz.gz