Fasta: Difference between revisions

Revision as of 11:20, 15 February 2025

Overview

Chrysolina fastuosa in copula

Fasta is a text-based format for representing nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.

History

The FASTA format was first introduced in the 1980s as part of the FASTA software package, which was developed for sequence alignment. The format has since become a standard in bioinformatics for sequence data exchange.

Format Description

A FASTA file begins with a single-line description, followed by lines of sequence data. The description line starts with a greater-than (>) symbol, followed by a sequence identifier and optional description. The sequence data follows, with each line typically not exceeding 80 characters.

Example

>sequence_1 Homo sapiens
ATGCGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC

Applications

FASTA format is widely used in bioinformatics for storing and sharing DNA, RNA, and protein sequences. It is compatible with many bioinformatics tools and databases, such as BLAST, GenBank, and UniProt.

Advantages

The simplicity and flexibility of the FASTA format make it easy to parse and manipulate. It is human-readable and can be easily edited with any text editor.

Limitations

FASTA format does not support rich metadata or annotations beyond the simple description line. For more complex data, formats like GenBank format or GFF may be more appropriate.

Related Pages

@@ Line 1: / Line 1: @@
-'''FASTA''' is a [[bioinformatics]] software package used for [[sequence alignment]] of [[nucleotide]] or [[protein]] sequences. It was developed by [[David J. Lipman]] and [[William R. Pearson]] in 1985. The FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
+{{DISPLAYTITLE:Fasta}}
-==History==
-FASTA was one of the first widely used algorithms for [[sequence alignment]] and [[database searching]]. It introduced the concept of using [[heuristics]] to speed up the process of finding similar sequences in large databases. The original FASTA algorithm was designed to search protein databases, but it was later adapted to search nucleotide databases as well.
+== Overview ==
-==Algorithm==
+[[File:Chrysolina_fastuosa_(copula).ogv|thumb|right|Chrysolina fastuosa in copula]]
-The FASTA algorithm works by first identifying regions of similarity between sequences using a [[hash table]] of short words (k-tuples). These regions are then extended to form longer alignments. The algorithm uses a scoring system to evaluate the quality of the alignments, taking into account factors such as [[gap penalties]] and [[substitution matrices]].
+'''Fasta''' is a text-based format for representing nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
-==FASTA Format==
-The FASTA format is simple and consists of a header line followed by lines of sequence data. The header line starts with a ">" character and is followed by a sequence identifier and optional description. The sequence data is represented in a single-letter code, with each line typically containing up to 80 characters.
+== History ==
-Example of a FASTA format:
+The FASTA format was first introduced in the 1980s as part of the FASTA software package, which was developed for sequence alignment. The format has since become a standard in bioinformatics for sequence data exchange.
- <nowiki>
- >sequence1
+== Format Description ==
- AGCTGATCGATCGTACGATCG
+A FASTA file begins with a single-line description, followed by lines of sequence data. The description line starts with a greater-than (''>'') symbol, followed by a sequence identifier and optional description. The sequence data follows, with each line typically not exceeding 80 characters.
- >sequence2
- CGTAGCTAGCTAGCTAGCTAG
+=== Example ===
- </nowiki>
+<pre>
-==Applications==
+>sequence_1 Homo sapiens
-FASTA is widely used in [[bioinformatics]] for tasks such as:
+ATGCGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC
-* [[Sequence alignment]]
+</pre>
-* [[Database searching]]
-* [[Phylogenetic analysis]]
+== Applications ==
-* [[Gene prediction]]
+FASTA format is widely used in [[bioinformatics]] for storing and sharing [[DNA]], [[RNA]], and [[protein]] sequences. It is compatible with many bioinformatics tools and databases, such as [[BLAST]], [[GenBank]], and [[UniProt]].
-==Related Software==
-FASTA has inspired the development of several other sequence alignment tools, including:
+== Advantages ==
-* [[BLAST]]
+The simplicity and flexibility of the FASTA format make it easy to parse and manipulate. It is human-readable and can be easily edited with any text editor.
-* [[Clustal]]
-* [[MAFFT]]
+== Limitations ==
-==See Also==
+FASTA format does not support rich metadata or annotations beyond the simple description line. For more complex data, formats like [[GenBank format]] or [[GFF]] may be more appropriate.
+== Related Pages ==
+* [[FASTA software]]
 * [[Sequence alignment]]
 * [[Bioinformatics]]
-* [[BLAST]]
+* [[GenBank]]
-* [[Clustal]]
+* [[UniProt]]
-* [[MAFFT]]
-* [[Substitution matrix]]
-* [[Gap penalty]]
-==References==
-{{Reflist}}
-==External Links==
-{{Commons category|FASTA}}
 [[Category:Bioinformatics]]
-[[Category:Sequence alignment algorithms]]
+[[Category:File formats]]
-[[Category:Computational biology]]
-{{bioinformatics-stub}}