BIOINFORMATICS<-->STRUCTURE
Jerusalem, Israel, November 17-21, 1996

Abstract


SWISS-PROT and its computer-annotated supplement TREMBL: New developments in the linking of biological databases and computer-generation of annotation

Rolf Apweiler, Alain Gateau & Vivien Junker
EMBL Outstation - The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
apweiler@ebi.ac.uk

SWISS-PROT is a curated protein sequence database with a high level of annotation, a minimal level of redundancy and a high level of integration with other databases. Two major changes to SWISS-PROT were introduced: the linking of SWISS-PROT entries, not only to the nucleotide sequence database entries, but to the CDS level of the feature table of EMBL nucleotide sequence database entries and the introduction of a computer-annotated supplement to SWISS-PROT: TREMBL.

Ongoing genome sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However, as we also want to make the sequences available as fast as possible we introduced TREMBL (TRanslation of EMBL nucleotide sequence database), as a supplement to SWISS-PROT. TREMBL consists of entries in SWISS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except CDS already included in SWISS-PROT.

TREMBL currently (May 1996) contains 98,447 sequence entries, comprising 25,879,831 amino acids, and is split into two different sections. SP-TREMBL (SWISS-PROT TREMBL) which contains entries that will be added, after complete annotation, to SWISS-PROT and REM-TREMBL (REMaining TREMBL) which contains entries not for inclusion in SWISS-PROT.

Most of the 81,861 sequence entries currently (May 1996) in SP-TREMBL are additional sequence reports of entries already in SWISS-PROT and will lead to updates of these SWISS-PROT entries. However, some 20,000 to 40,000 entries now in SP-TREMBL will eventually be included as new sequence entries in SWISS-PROT. Identical sequences in SP-TREMBL from the same species have been merged to reduce redundancy. Currently we are working on a further reduction of redundancyby establishing rules to merge sub-fragments to full-length sequences and also on the identification of sequence differences due to polymorphisms, strain variations and sequencing errors with the goal to establish rules to merge conflicting sequence reports about one and the same sequence into one entry. For SP-TREMBL to act as a computer-annotated supplement to SWISS-PROT new procedures have been introduced whereby valuable annotation has been added automatically. All relevant information in EMBL nucleotide sequence entries has been extracted and added to the SP-TREMBL entry as a way to enhance annotation content. A range of sequence analysis tools and the PROSITE pattern database arealso used to detect any consensus sequences/motifs present. Based on this, information about the potential function of the protein, metabolic pathways, active sites, cofactors, binding sites, domains, subcellular location, and otherannotation was added automatically to the entry whenever appropriate. We also make use of the ENZYME database. Information such as catalytic activity, cofactors and relevant keywords can be taken from ENZYME and added automaticallyto SP-TREMBL entries. Furthermore we make use of spezialized databases to parse information like the correct gene nomenclature into TREMBL entries. We are currently investigating methods for scanning Medline abstracts for relevant information that can be automatically added.

The production of TREMBL has emphasised the importance of linking not only to the whole EMBL entry but to linking within the EMBL entry. This point is highlighted by the numerous genome projects that are currently submitting sequences to the EMBL/Genbank/DDBJ Nucleotide Sequence Database. As these projects continue, longer contiguous sequences will be submitted. These longer contigs will contain many more CDS features resulting in many more SWISS-PROT/SP-TREMBL entries. In this context, the need for linking at the CDS feature level is evident. This linking has now been achieved by using the PID, the Protein IDentification number found in the /db_xref qualifier tagged to every CDS in the EMBL nucleotide sequence database. The DR lines of SWISS-PROT and TREMBL entries pointing to an EMBL database entry are now citing the EMBL ACnumber as primary identifier and the PID as secondary identifier. In all cases where a PID is already integrated into SWISS-PROT a /db_xref qualifier citing the corresponding SWISS-PROT entry is added to the EMBL nucleotide sequence database CDS labelled with this PID. In the remaining cases a /db_xref qualifier is pointing to the corresponding TREMBL entry.

For example, in the SWISS-PROT entry with accession number P10662 and the DR line:

DR EMBL; M15160; G171969; -.

Is represented in EMBL as:

FT CDS 80..1045
FT /db_xref="PID:g171969"
FT /db_xref="SWISS-PROT:P10662"

This approach enables us to point precisely from a given SWISS-PROT or TREMBL entry to one of potentially many CDSs in the corresponding EMBL entry, and vice versa. This change will allow the development of software tools that will automatically retrieve that part of a nucleotide sequence entry that codes for aspecific protein and will render obsolete the current situation where, for example, one needs to retrieve the complete sequence of a yeast chromosome when one wants the nucleotide sequence coding for a specific protein encoded on that chromosome.

Moreover, the concepts outlined contain a common goal and that is to link features from one dataset to all other relevant datasets. This is a goal that weare set on achieving, not only with SWISS-PROT but also with its supplement TREMBL. Alongside the development of tools to achieve automatic addition of relevant information we have achieved a much deeper integration with the EMBL Nucleotide Sequence Database which serves to enhance the close collaboration.


Back to the Invited Speakers Index.