Access to Biological Collections Data (ABCD) Primer
Please send comments, questions and additions on the Primer to Neil Thomson or Walter Berendsohn.
Title : Access to Biological Collections Data (ABCD) Primer
Date : 2006-07-27
Editors : Neil Thomson (NHM, London) <n.thomson [at] nhm.ac.uk>
Markus Döring (BGBM, Berlin) <m.doering [at] bgbm.org>
Renato De Giovanni (CRIA, Campinas) <renato [at] cria.org.br>
Javier de la Torre (MNCN, Madrid) <jatorre [at] gmail.com>
Walter Berendsohn (BGBM Berlin-Dahlem) <w.berendsohn [at] bgbm.org>
Wouter Addink (ETI, Amsterdam) <wouter [at] eti.uva.nl>
William Ulate (INBio, Santo Domingo de Heredia) <wulate [at] inbio.ac.cr>
Copyright : (C) TDWG 2006
Abstract : This primer is intended to provide an easily readable background to ABCD and
should take anyone with no knowledge of the standard to the very point where
they would be able to understand the principles and the more detailed
technical specification. Examples are given which are complemented by
references to the normative texts.
! Table of Contents
ABCD - Access to Biological Collections Data - Schema is a common data specification for biological collection units, including living and preserved specimens, along with field observations that did not produce voucher specimens. It is intended to support the exchange and integration of detailed primary collection and observation data.
All of the world's biological collections contain a number of data items including specimen specific (e.g. taxon, altitude, sex) and collection specific (e.g. holding institution) elements. The set of elements used varies from collection to collection. ABCD provides a standard set of element names and their definition for scientists and curators to use. It is not expected (or even possible) for any collection to use more than a fraction of the elements defined in the standard.
A design goal was to be both comprehensive and general, to include a broad array of concepts that might be available in a collection database, but to mandate only the bare minimum of elements required to make the specification functional. ABCD deliberately does not cover taxonomic data, such as synonymy, other than the use of names in identifications. Likewise, taxon-related information, such as distribution range, indicator values, etc., is also not included. The elements and concepts that are used provide as much compatibility as is possible with other standards in the field of biological collection data, such as HISPID, Darwin Core, and others. ABCD version 2 is a TDWG standard, which has been ratified by the annual TDWG meeting in September 2005. This standard is promoted by GBIF for use globally.
The technical data specification is cast as an XML schema.
2. Top-Level Structure
The ABCD schema is highly structured in order to manage the large quantity of data that a record may contain.
The top level of the schema is arranged as follows:
<Units> # Observations and Specimens
A minimum ABCD record could look like this:
<?xml version='1.0' encoding='UTF-8'?>
From this it can be seen that an XML document based on ABCD may contain records from several datasets, each of which is treated separately. Each dataset has a Globally Unique Identifier (GUID) along with information about who may be contacted for further details, for the content of the dataset and for technical information.
There are then two major groups, one holding metadata about the entire dataset and the other holding the actual data records.
The Metadata section holds information about an entire dataset and has the following structure:
- Icon URI
- Scope (Geo-ecological and Taxonomic)
- Revision data (Creator, Contributors, Creation and Modification dates)
- Intellectual Property Rights (IPR) statements
The second major section, called UNITS, holds all the records selected and exported from the original dataset, each one of which is a UNIT. This is by far the largest component of ABCD and has the following high-level structure:
Here we can distinguish several areas. Most of these do not show up in the actual XML hierarchy, because ABCD 2.06 avoids using container elements that serve only to group items together:
- Unit-level metadata
- Record basis and Kind Of Unit
- Collection domain-specific data
- Unit relationships (Associations and Assemblages)
- Named collections and surveys
- Gathering event and site characteristics
- Measurements and Facts
- Unit extension area
Initially, ABCD may appear somehow complex to the new user, but as the principles of its design are known, it will be found out how the data held in biological collections fits so well with the structure defined. The ABCD is highly structured in order to manage the large quantity of data that a biological collection record may contain. Some of the decisions on where to provide what information will depend closely on previous curatorial decisions made upon the management of the information itself, although usually a dataset will correspond to the information of a collection and each UNIT within the dataset will be related to a record of information from a particular specimen or observation in the biological collection.
Almost no internal referencing and (almost) no recursive structures will be found inside ABCD. This means that ABCD could be seen as a single-root document that allows processing to be easier and faster, without the inherent inconvenience of many relational structures using IDs.
ABCD was designed to be comprehensive, aiming to define the semantics of all elements to provide a unified approach for the natural history collection community, to accept detailed information (where available) and to develop a proto-ontology, as a first step towards a collection ontology.
The variable atomisation followed in ABCD, should allow the provision of data in different degrees of detail and standardisation, accepting data from a wide variety of sources and enabling data integration.
Extensible Slots included into ABCD should not be used for individualised adaptations of the schema. They are rather intended for fast community support in case of missing elements in the current version, before definite integration into a subsequent version. The ABCD extensible slots also provide for the inclusion of third-party-schemas (or parts thereof), in order to prevent duplication of developments in other communities (e.g. geographical data)
All along ABCD, there are also flexible containers included to allow freely defined and repeatable data fields according to the discipline or characteristics of data (e.g., higher taxa, measurements, morphological features, among others). These take the form of Element-element or element-attribute couples.
Besides several particular complex type elements (like PersonName or Monomial elements), it is common to find, throughout ABCD, string elements of two particular types, the string extended with a language attribute (StringL) and the string extended with a preferred attribute (StringP), and the combination of both (StringLP) in different lengths (50, 255 and unbounded). The string extended with a language attribute is used to indicate in which language is the textual information contained being provided; while the string extended with a preferred attribute is provided to indicate that the textual value contained within the element is to be preferred among others available.
In addition, for some elements, textual data could be provided even when an atomised form is impractical or imposible to provide. To allow this, you may see there is a provision for free-text data next to the atomised version. An example of this can be found in the scientific name element.
SourceInstitutionID, SourceID and UnitID are the three elements that conform the unique Unit record identifier and they correspond, respectively, to the identifier of the institution holding the original data source, the name or code of the data source unique within the institution and a unique identifier for the unit record within the data source. Therefore these are currently the only mandatory information at the Unit level. For an example, look at the minimum ABCD record shown above.
All the usual gathering information should be registered within the gathering element in the Unit element. This includes, but is not limited to, agents (collectors), dates, method, locality, site coordinates and altitude. Additional information like permits, project, depth, height, images references, aspect and notes, among others, could be provided here.
The identification related information could be registered in the Identifications section of the Unit element. Here, both, the current identification(s) and the identification history can be registered. It is worth noting that the result of the identification event would fit into the Identificatons/Identification/Result/TaxonIdentified element, where the higher taxa and the scientific name (or an informal name, when the later is not available) can be included, either as the string of the full scientific name or as a name atomised with subtypes according to the corresponding Bacterial, Botanical, Zoological or Viral Code.
Biological Collections Access Service for Europe http://www.biocase.org
Biological Collections Access Service http://www.biocase.org
Biological Collection Information Service in Europe http://www.bgbm.org/BioCise/
International Council for Science: Committee on Data for Science and Technology http://www.codata.org/
A simple set of data element definitions designed to support the sharing and integration of primary biodiversity data http://darwincore.calacademy.org/
Distributed Generic Information Retrieval http://www.digir.net
European Natural History Specimen Information Network http://www.bgbm.org/BioDivInf/projects/ENHSIN/
Global Biodiversity Information Facility
Geography Markup Language http://www.opengeospatial.org/standards/gml
Global Unique Identifier http://en.wikipedia.org/wiki/GUID
Herbarium Information Standards and Protocols for Interchange of Data
Markup Language http://www.w3.org/MarkUp/
The International Transfer Format for Botanic Garden Plant Records http://ww.bgbm.org/TDWG/acc/itf2-d32.doc
Integrated Taxonomic Information System http://www.itis.usda.gov/
ITIS Canada http://www.cbif.gc.ca/pls/itisca/
Scripting language (originally called LiveScript?
) developed by Netscape Communications for use with the Navigator browser http://www.mozilla.org/js/
Life Science Identifier http://lsid.sourceforge.net/
Open Geospatial Consortium, Inc. http://www.opengeospatial.org/
Red Mundial de Información sobre Biodiversidad http://www.conabio.gob.mx/remib/doctos/remib_esp.html
Structure of Descriptive Data. An TDWG, XML-based interoperability standard for descriptive data
Synthesis of Systematic Resources http://www.synthesys.info/
Research project developing standards and software tools for access to the world's natural history collection and observation databases http://speciesanalyst.net
Distributed Information System integrating primary data from scientific biological collections http://splink.cria.org.br
Universal Description, Discovery and Integration http://www.uddi.org
Universal Resource Identifiers http://www.w3.org/Addressing/#background
Universal Transformation Format http://www.unicode.org/
Web Feature Service http://www.opengeospatial.org/standards/wfs
Web Map Service http://www.opengeospatial.org/standards/wms
Document Type Definition http://www.w3.org/TR/html4/sgml/dtd.html
A formal definition of the mandatory and optional structure and content of XML formatted documents within its domain.
A typical example for a botanical specimen collected in Turkey:
<?xml version='1.0' encoding='UTF-8'?>
<FullScientificNameString>Acantholimon lycaonicum Boiss. & Heldr.</FullScientificNameString>
<AuthorTeam>Boiss. & Heldr.</AuthorTeam>