Life Science Identifiers (LSIDs)
Links
What is an LSID?
From
http://lsid.sourceforge.net/:
The Life Sciences Identifier (LSID) is an I3C? and OMG Life Sciences Research (LSR) Uniform Resource Name (URN) specification in progress. The LSID concept introduces a straightforward approach to naming and identifying data resources stored in multiple, distributed data stores in a manner that overcomes the limitations of naming schemes in use today. Almost every public, internal, or department-level data store today has its own way of naming individual data resources, making integration between different data sources a tedious, never-ending chore for informatics developers and researchers. By defining a simple, common way to identify and access biologically significant data, whether that data is stored in files, relational databases, in applications, or in internal or public data sources, LSID provides a naming standard underpinning for wide-area science and interoperability.
Discussion
Projects using LSIDs
The following projects are actively using
LSIDs or experimenting with their use:
Issues with LSIDs
These are some thoughts based on my (
RodericPage) experience. They are based on notes I sent to Chris Rawlings who is part of the
Brassica genome consortium (they are investigating
LSIDs).
Setting up the software to support
LSIDs is trivial for anybody with any experience using Perl. There are a couple of other key steps, some not so trivial.
- You MUST be able to add SRV records to the DNS. This means having a system administrator you is happy to add these records (it's a trivial task). This would only be an issue if the person/organisation serving the data didn't have complete control over the machines it's using (for example, basic server packages provided by commercial internet server providers might not support this). In practice, what you want is control over the domain name from which you serve the LSIDs.
- The best way to serve LSIDs seems to be setting up virtual servers on Apache. This is pretty straightforward (cut and paste a template from http://lsid.sourceforget.net, with a few minor edits, then restart Apache). You'd also want to add a record to the DNS, for example mapping lsid.my.org to the same IP as my.org.
- Then you need to serve the metadata and data, and this is basically a case of writing some Perl to talk to whatever database you are using, and deciding what is metadata and data. This will probably be driven by who will be using the data, for example whether you will be using technologies like MyGRID? or BioMoby?, which make explicit use of LSIDs (note that the current version of IBMs Perl code has trouble with BioMoby? LSIDs -- I haven't checked how the more recent code in CVS performs).
This is all fairly easy (as in, easy once you know how), and any programmer with Perl/CGI experience should be able to get something working in an afternoon (I mean, if I can do it it can't be that hard...).
Perl is probably adequate for most stuff. I've not done any benchmarking, but it seems to work OK at Glasgow. The
LSID metadata that I serve is almost always generated by calling web services on remote machines, hence any performance hit is likely to be the overhead in talking to these machines. If you plan to serve very large data sets (e.g., people would routinely download large chunks of the genome using
LSIDs then you might need to look at streaming data, or using FTP as the protocol to serve data (
LSIDs can support HTTP, and SOAP, and I think also FTP). I gather the reason LTER used Java and a commerical company was because they were going to serve very large datasets. I might be naive, but my guess is that most
LSIDs will be assigned to things where the size of data is actually fairly small (a few kb).
RodPage also raised on the mailing list this:
LSID seems to be bound to DNS.
BobMorris differs:
Some cite the appearance of a URL in an
LSID, and the discussion in Sec 13 of the spec (DDNS) as evidence that that
LSIDs are bound to the DNS and so not futureproof. This seems wrong to me.
First, the URL (actually a URN) is the "authority" part of an
LSID. It is not about resolvers, it identifies the issuer. The issuer is an eternal entity whether or not it still exists. You can't change the fact that mobot.org issued some particular
LSID. There is no special connection between the issuer and the resolvers except as may arise incidentally for administrative reasons.
Second, the DDNS service described in the spec is not about resolution. It is about locating resolution services. If a resolution service happens to exist in 2030 but the DNS does not, this is utterly without impact on
LSID resolution. It only impacts how you find resolvers. This is rather akin to the fact that most IP addresses are given out by non-authoritative servers. There is always only one authoritative server at any given time and acquiring \its/ IP address can be from a non-authoritative DNS server or a phone call to your friend. Finding this authoritative IP address is the resolution service location problem. If you happen to know the IP address of the authoritative server for a domain , you don't need any DNS servers at all, except that authoritative server, to find IP addresses for names for which it is authoritative. All the rest is about the discovery of that authoritative resolution service or its proxies. It is a (very important, scalable) performance issue that the other DNS servers near you in the network offer you something you are willing to rely on. (Even that is not such a great idea if you can't trust the chain all the way from that server to the authoritative one. It's technically easy for me to spoof the entire internet if I control all the DNS servers and routers you can connect to. Cf. Chinese internet). I raise this to argue more specifically, and I think slightly more relevantlly, in support of Chuck Miller's position against Rod's arguments about future proofing. The analogous situation, I think, is this: imagine in 2030 that IPv6 is in place, but nothing else about the internet is. In particular, imagine nothing like the DNS is in place. The
LSID resolvers will all still work. You just won't find them through the DNS. This is not a problem, because the authority part of
LSID is not about resolution. By the way, this is not such a far-fetched scenario, because there are very strong gathering forces internationally to centralize control of the internet, at least on a country by country basis. Controlling discovery of IP addresses is probably the first step, and is why there are arguments about it right now.
Finally, I think there is nothing about GUIDs that implies that the world is entitled to resolve them. We have unlisted numbers for POTS, and while it may violate the spirit of GBIF, and may be a requirement for GUIDS in the biodiversity community, "unlisted" resolvers would not violate the
LSID spec, would probably not violate any others, and is likely to be what software engineers call a non-functional requirement of any biodiversity GUID system. Non-functional requirements are genuine requirements for a project which are not requirements about the underlying problem. Ricardo raised universal access as one such on a posting to the mailing list. The widely accepted(?) requirement that GUID issuance should be free of monetary cost is another. Availability of open-source support components might be another. Hopefully the workshop will identify both the functional and non-functional requirements for GUIDS.
BobMorris January 30 2006
I am looking for arguments why we need
LSIDs in the first place, instead of using simple (P)URL-GUIDs with the same social contract about permanence of identifier (not resource) as in
LSIDs. The contract for
LSIDs is essentially social as well. I fail to see any advantage of
LSID over community driven PURLs (persistent URLs). PURLs seem to be standard technology, not raising any of the issues discussed here. I am unconvinced by arguments that
LSIDs are not
so bad if they are not
required in the first place. Please add to
TechnologyComparison, but perhaps management or social advantages and disadvantages need to discussed as well.
--
GregorHagedorn - 14 May 2007
Categories
CategoryLSID