BIOINFORMATICS<-->STRUCTURE
Jerusalem, Israel, November 17-21, 1996

Abstract


An introduction to mmCIF: conversion between PDB format and mmCIF

Frances C. Bernstein (1), Herbert J. Bernstein (2) and Philip. E. Bourne (3)

(1) Protein Data Bank, Biology Dept., Brookhaven National Lab., Upton, NY, USA,
(2) Bernstein + Sons, 5 Brewster Lane, Bellport, NY, USA,
(3) San Diego Supercomputer Center, PO Box 85608, San Diego, CA, USA

bernstei@fcb.pdb.bnl.gov


The relationship between the mmCIF format for macromolecular structures and the Protein Data Bank coordinate entry format is discussed, giving examples of how to translate between them.

The Protein Data Bank format has been used for over 20 years to archive macromolecular data, is produced by many refinement programs, and is used as an input format by many applications. The pending adoption of the mmCIF dictionary by the IUCr, in response to the need to explicitly represent a larger amount of data which can be parsed by computer, (necessary as the number of structures continues to grow exponentially), makes translation between mmCIF format and PDB format a pressing issue.

The two formats are different both in presentation and in content. The PDB format consists mainly of fixed format fields in an ordered set of records. The mmCIF format uses a tag-value style of presentation in which the name of each data item is displayed along with the associated value. Thus mmCIF format has very little sensitivity to the ordering of the information. The content of PDB entries is organized around the presentation of sets of atomic coordinates associated with chains and HET groups and has little normalization. The content of mmCIF data sets is organized around "entities" (discrete chemical components) and is extensively normalized. Normalization is a concept from the design of databases in which data is organized into the rows and columns of tables with a single data item in each table position, with unique keys to identify each row, and minimal repetition of the same information, so that it is easier to update, check and retrieve data reliably. With care, all the information of interest about a macromolecule can be presented in either format clearly and efficiently, but challenging problems arise in moving between the two formats.

The program pdb2cif can translate a PDB entry into a dataset that is substantially compliant with the mmCIF dictionary, though careful checking of the results is suggested. The program cif2pdb can translate an mmCIF dataset into a "pseudo-PDB" entry able to be used, for example, as input to graphic display programs. Both programs will be extended as the mapping between the two formats becomes better understood.

For mmCIF: http://ndbserver.rutgers.edu
For PDB: http://www.pdb.bnl.gov

Work supported in part by US NSF, PHS, NIH, NCRR, NIGMS, NLM and DOE under contract DE-AC02-76CH00016 (for FCB) and US NSF grant no. BIR 9310154 (for PEB)


Back to the Abstract Index.