Stockholm format: Difference between revisions
CSV import Tags: mobile edit mobile web edit |
CSV import |
||
| (One intermediate revision by the same user not shown) | |||
| Line 35: | Line 35: | ||
{{bioinformatics-stub}} | {{bioinformatics-stub}} | ||
{{No image}} | |||
__NOINDEX__ | |||
Latest revision as of 02:51, 18 March 2025
Stockholm format is a multiple sequence alignment format used by many bioinformatics and computational biology tools. It is designed to represent both the sequence alignment and annotations related to the alignment. The Stockholm format is versatile, supporting both the alignment data and additional information such as conserved regions, secondary structure predictions, and database references.
Overview[edit]
The Stockholm format is a flat-file format that is both human-readable and machine-parseable. It is distinguished by its ability to store not only the sequence alignment itself but also a rich set of annotations for each sequence and the alignment as a whole. This makes it particularly useful in the fields of genomics, proteomics, and molecular biology, where understanding the functional and structural aspects of sequences is crucial.
Format Specification[edit]
A Stockholm file consists of a series of lines, each starting with a specific identifier that indicates the type of information contained in that line. The key components of a Stockholm file include:
- Sequence data: Each sequence in the alignment is represented by a single line, prefixed with the sequence identifier followed by the aligned sequence.
- GF (Generic File annotations): These lines contain information applicable to the entire file, such as database references or consensus secondary structure.
- GS (Generic Sequence annotations): These lines provide information specific to a single sequence within the file, such as source organism or accession numbers.
- GR (Generic per-Residue annotations): These lines contain annotations for individual residues within a sequence, such as secondary structure predictions or residue conservation scores.
- #=GC (Consensus annotations): These lines are used to represent consensus annotations for the alignment, often used for indicating conserved positions.
The format is bookended by a header line (# STOCKHOLM 1.0) to indicate the start of the file and a terminal line (//) to mark the end of the alignment.
Usage[edit]
Stockholm format is widely used in bioinformatics software and databases, including HMMER for homology searches and profile Hidden Markov Model (HMM) building, and the Pfam and Rfam databases for protein and RNA families, respectively. Its ability to carry extensive annotations along with the alignment makes it a preferred format for detailed analysis of sequence features and evolutionary relationships.
Advantages[edit]
- Rich Annotations: The format supports extensive annotations, which are crucial for understanding the biological significance of the sequences.
- Flexibility: It can represent both nucleotide and amino acid alignments, making it suitable for a wide range of applications in molecular biology.
- Compatibility: Many bioinformatics tools and databases support the Stockholm format, facilitating easy data exchange and integration.
Limitations[edit]
- Complexity: The richness and flexibility of the format can also make it more complex to parse and generate compared to simpler formats like FASTA.
- File Size: Annotations can significantly increase the size of the files, which might be a concern when dealing with large datasets.
Conclusion[edit]
The Stockholm format plays a crucial role in bioinformatics, offering a comprehensive way to store and share sequence alignments along with valuable annotations. Its widespread adoption across tools and databases underscores its importance in facilitating advanced molecular biology research.