FASTA format: Difference between revisions

Latest revision as of 12:36, 17 March 2025

FASTA format is a text-based format for representing nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The name "FASTA" derives from the FASTA software package, first developed in the 1980s by David J. Lipman and William R. Pearson, which was designed for sequence alignment and searching. Today, FASTA format is widely used in bioinformatics for sequence alignment, sequence database searches, and in various types of bioinformatics software and databases.

Format[edit]

The FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (>) symbol at the beginning. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence.

Example[edit]

>seq1 Two different sequences
GATCAGTAGC
>seq2 Another sequence
TTAGGATCTG

In this example, there are two sequences. The first sequence has an identifier of "seq1" and a description of "Two different sequences". The sequence "GATCAGTAGC" follows the description. The second sequence is identified by "seq2" with a description of "Another sequence" and has the sequence "TTAGGATCTG".

Usage[edit]

FASTA format is used for a variety of purposes in bioinformatics, including:

Sequence alignment: Tools like BLAST (Basic Local Alignment Search Tool) and Clustal use FASTA format for input and output sequences.
Sequence database searches: Databases such as GenBank, EMBL, and Swiss-Prot allow users to download sequences in FASTA format.
Molecular biology software: Many software tools for sequence analysis, gene prediction, and other tasks accept sequences in FASTA format.

Advantages and Limitations[edit]

The simplicity of FASTA format is a major advantage, making it easy to create, edit, and parse with basic text-processing tools. However, this simplicity also means that FASTA format lacks the ability to represent complex annotations and features of sequences, such as gene locations, exons, and introns. For more complex annotations, formats such as GenBank format or GFF (General Feature Format) are more appropriate.

@@ Line 37: / Line 37: @@
 [[Category:Computational biology]]
 [[Category:Sequence alignment]]
+{{No image}}
+__NOINDEX__