sequence name variable in links.rb?

Hi there,

I’m looking to customize links.rb and see: "whichdb is slow. Alternative is to encode db info (a short name) in the sequence id, and use regex matching to decide which database a hit came from.", which I’ve done, but I don’t see a reference in that help, nor the sample links.rb file as to what variable(s) to use to access that name?

Is it https://www.rubydoc.info/github/wurmlab/sequenceserver/SequenceServer/Sequence#accession-instance_method? Perhaps the documentation could be clarified, maybe even with some links to the rdoc}? Or perhaps add some comments in the example links.rb file?

Thanks!
Matt

Hi Matt,

I agree that the documentation can use an update in this regard. I guess once the release sequence for version 2.0 is done, I will turn my attention to documentation and other maintenance activities. In the meantime I have a few pointers for you below.

You want to be looking at the Hit class for custom link related functionality. Specifically, accession, id, and title methods. id is everything up to the first space character in the FASTA sequence header; title is everything after. accession is typically the same as id, except when you use NCBI like format for id. The important thing here is that id + title gives you the whole FASTA sequence header.

Using regular expression you can derive any data encoded in the FASTA sequence header, be it in id or in the title. For example, we add a four letter species identifier to all sequence ids on antgenomes.org so ‘contig0001’ becomes ’Sinv_contig0001’ or ‘mRNA0001’ becomes ‘Cflo_mRNA0001’. Here Sinv and Cflo stand for the ant species Solenopsis invicta and Camponotus floridanus respectively.

Now in a link generator function (any non-private function in module SequenceServer::Links) you can get the first part as follows:

match_parts = id.match(/^([a-zA-Z]{4})_/)
species_identifier = match_parts[1]

And act on it using if-else, case-when, etc:

case species_identifier
when ’Sinv’

when ‘Cflo’

end

This can get as simple/complex depending on how well structured/consistent your sequence headers are and which external URLs you are linking to.

Here’s a real example, although I do use whichdb here, but does illustrate the aspects I discussed above too - https://github.com/yeban/sequenceserver/blob/fourmidable/lib/sequenceserver/links.rb (sorry, not the simplest code I have written). You want to be looking at Hymenopterabase and Ensemble bits.

The NCBI link generator from the above links is now part of the main SequenceServer code base.

The distinction between accession and id is better documented in Sequence class which you linked to, but see the file directly instead of rdoc. This is simply because the distinction became important in Sequence class which is used for sequence retrieval functionality, but we are quite consistent with the terminology so the concepts apply back to Hit as well.

Priyam

Perfect, thanks so much, including the example. All well understood.

Cheers,
Matt