What are LSID vocabularies?
LSID Vocabularies are the building blocks of the TDWG Ontology. They provide the semantics needed to describe the objects exchanged in our community. Their initial use was in encoding the information returned by the getMetadata() call of the LSID resolution mechanism - hence their name - but they can be used in any XML or Semantic Web based technology to express concepts associated with biodiversity.
When a client application resolves an LSID it gains access to the metadata associated with that particular LSID by its owner. For this metadata to be useful the client application must process it. It must read the data and display it to the user or use it in a larger calculation involving other data.
Most useful client applications are likely to combine data of different kinds from multiple sources i.e. not just consume specimen or observation data but combining it with geographic, phylogenetic, molecular and other data.
If the metadata could be in any arbitrary XML based format then the task of writing a client application would be exceedingly difficult.
To facilitate consumption of metadata associated with LSIDs it is important that metadata from different sources about different things is in the same basic format and can be meaningfully combined.
LSID vocabularies are descriptions of the metadata returned for particular classes of object within the TDWG domain. They enable data about different things from different places to be combined in meaningful ways. They form part of a larger TDWG ontology effort that describes how these classes of data are related.
Technically each LSID Vocabulary is an OWL ontology containing one or more classes and a number of properties who's domain is in one of those classes. There are also a set of global properties that can be used with any of the LSID Vocabulary classes.
The metadata returned when the LSID is resolved is a instance of one of these OWL classes containing some or all of the class and global properties. The metadata is RDF encoded in XML. Those unfamiliar with semantic web technologies can think of it as an XML document that follows a specific format. Although the vocabularies are defined in terms of Semantic Web technologies (RDF and OWL) the data providers can think of the metadata as pure XML. It is possible for non Semantic Web applications, even simple scripts, to inter-operate with semantic base applications by generating XML that is also valid RDF.
An Important Note
The LSID vocabularies only define new concepts (classes, properties and instances) that are used in the biodiversity informatics domain. It is assumed that they will be used in conjunction with other well known vocabularies such as those provided by the Dublin Core
metadata initiative. The vocabularies in themselves will probably not provide enough information to define an entire data exchange standard for a thematic network of the kind that currently exist. Extra information (on the use of Dublin Core properties for example) will need to be provided in a human readable document or some other form. See TDWGOntologyGovernance
for more information.
The TDWG LSID Vocabularies
This is a list of the current TDWG vocabularies. Some are under development and other are established and being used. TDWGOntologyGovernance
describes how this process is managed. Each vocabulary has a wiki page for linking resources and discussion.
| Name || Page || Status || Comment |
| TaxonName || TaxonNameLsidVoc || Available || First vocabulary to be developed. Used by IPNI, Index Fungorum and shortly ZooBank. Based on TCS. In preparation for to become the basis for a TDWG Standard |
| TaxonRank || TaxonRankLsidVoc || Available || Support vocabulary that provides an enumeration of taxonomic ranks. Used by TaxonName vocabulary and may be used by others. Derived from TCS. Will be standardised along with the TaxonName vocabulary. |
| TaxonConcept || TaxonConceptLsidVoc || Available || Central to all taxonomy. Based on TCS. Currently used as an embedded object by the TaxonOccurrence? vocabulary. Will be basis for Catalogue of Life/Species 2000 LSID service in late 2007 |
| TaxonOccurrence || TaxonOccurrenceLsidVoc || Available || The minimum required to exchange an observation and specimen data. Replaces the eariler SpecimenLsidVoc and OccurrenceRecordLsidVoc. Very highly based on Darwin Core. Used by the GBIF web services. |
| Person || PersonLsidVoc || Developmental || Will be based on BiogML? along with requirements gathered from IPNI Authors database and ZooBank. Used embedded in other vocabularies. |
| Team || TeamLsidVoc || Developmental || A collection of individuals in order with roles. Used embedded in other vocabularies. |
| PublicationCitation || PublicationCitationLsidVoc || Developmental || Required by multiple other vocabularies. Will be based on TaxMLit? and Levels 1 and 2. |
| Institution || InstitutionLsidVoc || Developmental || Based on NCD |
| Collection || CollectionLsidVoc || Developmental || Based on NCD |
What are the main aims of vocabularies?
There are two main aims:
- Enable the typing of metadata objects associated with LSIDs.
- Enable sufficiently complex mark up of metadata for the returned object to be generally useful.
LSID Vocabularies describe functional packets of data that contain the literal data values it is believed a typical client application will need to do 'something useful' with the metadata object. They may also contain links to other data objects.
Some simple use cases
- Result of Resolution: Client (application or human) finds data tagged with an LSID and resolves the LSID to discover what it refers to. The resulting metadata is displayed to the human user or used in larger calculation.
- Data Cleaning: Client has data (possibly corrupt/dirty) about an object (e.g. PublicationCitation? or TaxonNames?) and would like to tag their data internally with LSIDs from a source that is authoritative for this kind of data (e.g. library service or nomenclator). Client submits object encoded according to LSID vocabulary. Authority returns possible matches for data. Client chooses a match (automatically or manually) and tags their data with the appropriate LSID.
- Data Indexing: Client caches metadata about LSIDs for better search performance. Only preferred axioms for any LSID are stored. Results of search are linked to providers via LSID. Linking of data from multiple heterogeneous data sources allows for innovative query expansion - possibly using inference.
What guarantees do LSID Vocabularies bring?
The metadata returned by an LSID should contain an rfds:type property that stipulates which LSID vocabulary class this metadata represents an instance of. If this is lacking then the LSID does not refer to an instance of a TDWG recognized class.
Beyond this the LSID vocabularies do not guarantee anything about the data received. They enable a data supplier to express what they are trying to pass. To mark up data. A consumer of data the has no way to validate that the information is 'correct' because the notion of correctness or validity is likely to change between consumers. Data must be fit for a
purpose not all
It should be possible however to use ontologies you have created yourself or that are supplied by a third party to check the data is fit for a particular purpose. Such ontologies that become generally useful should be come TDWG standards.
LSID vocabularies do come with copious notes and recommendations on which properties should be supplied under what circumstances and what the properties should contain. Reputable data providers will follow these guidelines.
How do LSID Vocabularies relate to current TDWG standards?
All properties in LSID vocabularies should contain notes on how they related to existing XML schema base standards such as ABCD, DarwinCore?
, Taxon Concept Schema and others thus facilitating manual mapping. In the future there may be automatic mapping techniques. TAPIR
output models may be one way to achieve this.
There is a mapping between DarwinCoreDraftStandard
and the LSIDVocs on the LsidVocsDarwinCore
How do LSID Vocabularies related to external vocabularies like Dublin Core?
The use of external vocabularies is positively encouraged and documented.
How do I use LSID Vocabularies?
This is dealt with on a separate page - LsidVocsUsage
. If you are interested in exploring the vocabularies to get a feel for what they are from a technical point of view have a look at LsidVocsExploring
"There isn't a class/property to express what I want to say."
If you believe that there is a general need for a new class or property to be added then the best thing to do is define what you want in your own namespace and demonstrate it working with data in your own LSID authority. If the community as a whole find what you have done is useful then it should be easy to move it into the TDWG namespace and encourage others to use it.
If you need help in doing this then some one on the TAG or LSID mailing lists will be pleased to help you out.
What are global properties?
Some properties can occur in any class and, most importantly, have the same meaning no matter which class they occur in. There are two types of global properties:
+ External Properties
These come from other vocabularies. The properties provided by DublinCore?
are an example of this.
+ Common Properties
These are defined within the Common vocabulary for use with any class.
The usage of both types of properties (such as recommending the use of dc:title) should be documented externally to the LsidVocs
This section gives additional justification and explanation for some aspects of the vocabularies. You may only wish to read this out of interest or if you are designing your own classes for use with LSID metadata.
What is the Literal and Link design pattern?
There is a fundamental problem in modeling classes of objects to be returned by LSIDs that is related to, but differs from, the normalization of a relational database model. When the metadata for an LSID is requested the consumer needs to receive enough data to act on. If the data model is fully normalized and the LSID only returns its direct properties then the data returned may only consist of links to other resources that have to be traversed before anything 'useful' by way of data literals is discovered. This can be likened to making simple SQL selects on single tables in a complex relational database one at a time - never being able to carry out a join operation on the server always having to run another query with the foreign keys as values.
One solution to this problem is for the data provider to return more than just the values in the immediate properties of the object - to return some embedded objects as well. For this solution to work two further problems have to be solved:
- There must be some way of designating which objects the server must embed. The client would have to pass this information along with its request or the returned graph would have to be defined somewhere before hand.
- It is unclear what the server should do if it doesn't 'own' the embedded objects. If the objects referred to by properties of the requested LSID are owned by another LSID authority should the server attempt to retrieve them in order to build a complete object graph for the client or should it leave them as just links? It is actually desirable that the server does not own the linked data objects - see below.
The LSID Vocabularies take another approach. Each class is designed to be a package of useful literals about the object as direct properties. Clearly which literals are useful is a design time decision but there seems to be enough similarity amongst potential applications to make this call and properties can always be added at a later date. In addition to the literal values the classes also have properties that are links to other objects. Sometimes these objects are fuller versions of values that also appear in literals.
object has the property TaxonName::authorship
which contains the author string for the scientific name. In most cases the client application will just want this string. Typically it is needed separately from the rest of the name so that it can be rendered in a different font in the user interface. Some data providers have more than just a string on the authors though and some clients may be interested in having this data. To support this there is the TaxonName::authorTeam
property that has the range of Actor
This is what is termed the Literal and Link design pattern. Wherever it is felt that the majority of communication needs will be served by a literal representation of a property value then such a property is included even if there is another similar property that links to a object. The two properties are linked together in the vocabulary by the dc:relation
This design pattern would be bad in a closed world environment. To design a relational database model in which many foreign key fields had an associated literal field would be madly redundant and a nightmare to maintain. In an open world model this pattern is not crazy primarily because the data may reside not only on different machines but be owned by different organizations. This spreading of data is a desirable situation and should be encouraged. For LSIDs to be effective it is important that data suppliers reuse the LSIDs of other suppliers as often as possible.
Taking the TaxonName::authorship
example. A server may provide both a string in the TaxonName::authorship
property and a link to an AuthorTeam in the TaxonName::authorTeam
property but the link may be to another service run by a separate organization. The server is saying that it considers the string that it holds in its database is a representation of the data object supplied by another authority.
Tagging Rather Than Extending
When modeling relationships within a domain the first decision that has to be made is whether it is a 'is a' relationship or a 'has a' relationship. It is good to have a policy on this matter rather than having to make the decision afresh every time a new relationship is created. A description and discussion of this policy is on the SubclassOrNot