Raw DNA-Seq Data Processing

Use this workflow sample to process raw DNA-seq next-generation sequencing (NGS) data from the Illumina platform. The processing includes:

  • Filtration:
    • Filtering of the NGS short reads by the CASAVA 1.8 header;
    • Trimming of the short reads by quality;
  • Mapping:
    • Mapping of the short reads to the specified reference sequence (the BWA-MEM tool is used in the sample);
  • Post-filtration:
    • Filtering of the aligned short reads by SAMtools to remove reads with low mapping quality, unpaired/unaligned reads;
    • Removal of duplicated short reads.

The resulting filtered short reads assembly is provided in the SAM format. Intermediate data files are also available in the output.

How to Use This Sample

If you haven’t used the workflow samples in UGENE before, refer to the “How to Use Sample Workflows” section of the documentation.

Workflow Sample Location

The workflow sample “Raw DNA-Seq processing” can be found in the “NGS” section of the Workflow Designer samples.

Workflow Image

Two versions of the workflow are available. The workflow for single-end reads looks as follows:

The workflow for paired-end reads appears as follows:

Workflow Wizard

The workflows have similar wizards. The wizard for paired-end reads has 5 pages.

  1. Input data: On this page, you must input FASTQ file(s).

  2. Pre-processing: On this page, you can modify filtration parameters.

    The following parameters are available for reads and read pairs filtration:

    ParameterDescription
    Base qualityQuality threshold for trimming.
    Reads lengthToo short reads are discarded by the filter.
    Trim both endsDetermines whether to trim both ends of a read. Usually, set True for Sanger sequencing and False for NGS.
    3’ adaptersA FASTA file with one or multiple sequences of adapters that were ligated to the 3’ end. The adapter itself and anything that follows is trimmed.
    5’ adaptersA FASTA file with one or multiple sequences of adapters that were ligated to the 5’ end. An anchored adapter must appear entirely at the 5’ end of the read.
    5’ and 3’ adaptersA FASTA file with one or multiple sequences of adapters that were ligated to the 5’ end or 3’ end.
  3. Mapping: On this page, you must input the reference and optionally modify advanced parameters.

    The following parameters are available:

    ParameterDescription
    Reference genomePath to indexed reference genome.
    Number of threadsNumber of threads (-t).
    Min seed lengthPath to indexed reference genome (-k).
    Band widthBand width for banded alignment (-w).
    DropoffOff-diagonal X-dropoff (-d).
    Internal seed lengthLook for internal seeds inside a seed longer than {-k} (-r).
    Skip seed thresholdSkip seeds with more than INT occurrences (-c).
    Drop chain thresholdDrop chains shorter than FLOAT fraction of the longest overlapping chain (-D).
    Rounds of mate rescuesPerform at most INT rounds of mate rescues for each read (-m).
    Skip mate rescueSkip mate rescue (-S).
    Skip pairingSkip pairing; mate rescue performed unless -S is also in use (-P).
    Mismatch penaltyScore for a sequence match (-A).
    Mismatch penaltyPenalty for a mismatch (-B).
    Gap open penaltyGap open penalty (-O).
    Gap extension penaltyGap extension penalty; a gap of size k costs {-O} (-E).
    Penalty for clippingPenalty for clipping (-L).
    Penalty unpairedPenalty for an unpaired read pair (-U).
    Score thresholdMinimum score to output (-T).
  4. Post-processing: On this page, you can modify post-processing parameters.

    The following parameters are available:

    ParameterDescription
    MAPQ thresholdMinimum MAPQ quality score.
    Skip flagSkip alignment with the selected items. Use the combo box to configure the bit flag. Do not select items to avoid filtration by this parameter.
    RegionRegions to filter. For BAM output only. Use chr2 to output the whole chr2, chr2:1000 to output regions of chr2 starting from 1000, chr2:1000-2000 to output regions of chr2 between 1000 and 2000 including the endpoint. Use space separators for multiple regions (e.g., chr1 chr2 chr3:1000-2000).
    For single-end readsRemove duplicates for single-end reads.
  5. Output data: On this page, you must input output parameters.