determine which database hit came from in order to use it in custom links

Volodymyr_Zavidovych · October 9, 2012, 4:43pm

hello

On a certain query I receive hits in 2 different databases:

/path/to/chicken_database.fa
/path/to/zebrafinch_database.fa

I need to link each hit to corresponding genome browser

However in customisation.rb the “options” array passed to “construct_custom_sequence_hyperlink(options)” looks something like this for a given hit:

{:sequence_id=> "lcl|Un_AADN03021941 ", :databases=>[ "/path/to/chicken_database.fa ", "/path/to/zebrafinch_database.fa "], :hit_coordinates=>[3329, 3434]}

i.e. it contains information about ALL databases that had hits, as opposed to providing information about specifically what database does THIS hit come from. Hence it is impossible for me to link each hit result to corresponding genome browser because I don’t know which database to link to.

is there a way around this issue? i apologize if i misunderstood something

Thanks!
Volodymyr

Ben_Woodcroft · October 9, 2012, 11:04pm

Hi Volodymyr,

Good question. You seem to be understanding the situation just fine.

The problem is that each hit is not assigned a database in the blast HTML output. Therefore the database cannot be parsed by sequenceserver to provide you the exact database in the options hash as you are requesting (at least when just using the HTML output).

I can see two solutions for this problem. The first, probably the simplest, is to rename all the sequences so that you can derive the name of the database from the sequence identifier. e.g. Instead of “Un_AADN03021941” call it “chicken_Un_AADN03021941” or something.

The second is to use the NCBI tool blastdbcmd to check if your sequence is contained in each database ie. run it once for each database, until you find it. The syntax escapes my memory but I’m pretty sure it can be done. If you do end up taking this route, it’d be great if you could share the code!

Thanks,
ben

Volodymyr_Zavidovych · October 10, 2012, 3:53pm

Hi Ben

Thanks for the response! I imagine that using blastdbcmd to search all relevant databases for each hit would be very computationally expensive as sometimes there’s dozens of hits across several databases, and some of the databases are >30k contigs. So as a short term solution I’ll rename all sequence identifiers to contain database information as you suggested. I’ll also contact BLAST people to see if they are planning to include database information for each hit in the HTML output.

Thanks
Volodymyr

Volodymyr_Zavidovych · October 12, 2012, 3:45am

Actually, seems like blastdbcmd is pretty fast - I implemented method you suggested on our server and it didn’t slow down the search at all. I’ve submitted a pull request in case you’ll be interested in adding this feature to main repo:

https://github.com/yannickwurm/sequenceserver/pull/99

Thanks!
Volodymyr