Multi-part database volume

Hi Yannick, Anurag, Cedric and Ben,

Many thanks for developing and maintaining SequenceServer. Its been very nice to have an alternative to the www-blast package.

Got a quick question. I’ve set up a custom .nal file using the directions in the SequenceServer FAQ to aggregate ~45 individual BLAST DBs that were created via “sequenceserver format-databases”.

The structure of the all_dbs.nal file is:

TITLE all_dbs
DBLIST /path/to/db1 /path/to/db2 …

After restarting Apache and revisiting my SequenceServer page, the alias isn’t shown and I get the following in my httpd/error_log:

I, [2012-08-27T14:19:05.170846 #27568] INFO – : Found a multi-part database volume at /path/to/all_dbs - ignoring it

All the single DBs are correctly identified and on the page, so that should be OK.

I updated SequenceServer less then a week ago, so I don’t think the issue is a dated installation.

I see that Mark Anthony Gibbins added this function and is running a fork of the project. Has his changes been integrated into SequenceServer mainline? If so, any suggestions on what might be the issue?

Thanks in advance,

Halocaridina

Hi Halocaridina,

Would you be able to post a few things for me:

  1. The full filenames of the database parts, and their alias file.
  2. Contents of the alias file.
  3. Any other relevant lines from the SS log.
    I couldn’t determine from your message, but the alias file and database parts must be in the same path. The specifications of the multipart alias file from NCBI force this.

Also, the log message is simply a notification that a multipart database was found so this looks to be a different problem.

Thanks!
Mark

I couldn't determine from your message, but the alias file and database
parts must be in the same path.

Right. But the FAQ[1] "can I use preformatted BLAST database"
contradicts this. So most likely Halocaridina's alias file is in a
separate directory, while the volumes in the database directory. And
probably that is what the issue is: SS ignores the volumes, and never
sees the alias file which is in a separate directory.

I think we should remove that FAQ entry. For one, the alternative
solution is confusing and less preferable since SS handles it out of
the box (thanks to Mark). Second, FAQ makes it sound like an edge
case while it is not. Third, let's not suggest users that modifying
alias files themselves is ok. It's not. It incurs additional
administrative burden. Yannick?

The specifications of the multipart alias
file from NCBI force this.

I don't think so. Alias files are plain text files with a pointer to
the volumes. If the pointers are absolute path, BLAST+ can find them
regardless of whether alias and volumes are in the same directory or
not. Yannick's solution (the FAQ) suggests the same.

[1]: http://www.sequenceserver.com/#faq

I couldn’t determine from your message, but the alias file and database
parts must be in the same path.

Right. But the FAQ1 “can I use preformatted BLAST database”
contradicts this. So most likely Halocaridina’s alias file is in a
separate directory, while the volumes in the database directory. And
probably that is what the issue is: SS ignores the volumes, and never
sees the alias file which is in a separate directory.

Agreed, it sounds like path-related problem to me.

I think we should remove that FAQ entry. For one, the alternative
solution is confusing and less preferable since SS handles it out of
the box (thanks to Mark). Second, FAQ makes it sound like an edge
case while it is not. Third, let’s not suggest users that modifying
alias files themselves is ok. It’s not. It incurs additional
administrative burden. Yannick?

Indeed - perhaps we could warn users if a multipart database is found without a corresponding alias file? And vice-versa.

And with regard to the FAQ, we could change it to emphasise that if you are using multipart databases,
the alias file must be stored under the SS db path regardless of where the parts are.

What do you think?

The specifications of the multipart alias
file from NCBI force this.

I don’t think so. Alias files are plain text files with a pointer to
the volumes. If the pointers are absolute path, BLAST+ can find them
regardless of whether alias and volumes are in the same directory or
not. Yannick’s solution (the FAQ) suggests the same.

Ah so it does! For some reason I was sure I came across it saying they had to be in the same path as the alias file. Apologies.

Hi Mark and Anurag,

Thanks for the quick reply. Answers to your questions are:

  1. The .nal file is in the same directory as the DBs created using “sequenceserver format-databases”.

  2. Each DB has six files associated with it having extensions: .nhr .nin .nog .nsd .nsi .nsq

  3. For each DB in that directory, there is a symlink to the fasta file that was used to build the database and the symlink shares the same name as each DB. Thinking that having the symlinks and DBs sharing the same name in the same directory might be the problem, I moved the symlinks out of the directory, restarted Apache and revisited the SS page. Same “- ignoring it” message.

  4. The .nal file has the following structure:

DBLIST /home/data_processed/db_4_blast/Abarenicola_pacifica /home/data_processed/db_4_blast/Alciopa_spp [absolute paths to the other 43 DBs, each separated by a space]

  1. I’m using the basenames of each DB (i.e.,no extensions) with an absolute path in the .nal file, which should be correct syntax as Anurag pointed out.

Am I just missing something obvious?

Thanks again,

Halocaridina

Hi Mark and Anurag,

Thanks for the quick reply. Answers to your questions are:

  1. The .nal file is in the same directory as the DBs created using “sequenceserver format-databases”.

  2. Each DB has six files associated with it having extensions: .nhr .nin .nog .nsd .nsi .nsq

  3. For each DB in that directory, there is a symlink to the fasta file that was used to build the database and the symlink shares the same name as each DB. Thinking that having the symlinks and DBs sharing the same name in the same directory might be the problem, I moved the symlinks out of the directory, restarted Apache and revisited the SS page. Same “- ignoring it” message.

  4. The .nal file has the following structure:

DBLIST /home/data_processed/db_4_blast/Abarenicola_pacifica /home/data_processed/db_4_blast/Alciopa_spp [absolute paths to the other 43 DBs, each separated by a space]

  1. I’m using the basenames of each DB (i.e.,no extensions) with an absolute path in the .nal file, which should be correct syntax as Anurag pointed out.

I think I have an idea of the problem now, but just to confirm would you be able either list the databases SS lists in the web interface or upload a screenshot? imgur.com is fine if you don’t mind it being public.

This probably has something to do with the way SS determines what is and isn’t part of a database set.

Hi Mark,

Sure, the list is:

Abarenicola_pacificaAlciopa_sppAncistrosyllis_groenlandicaAphroditida_japonica
Axiothella_rubrocincta
Boccardia_proboscidea
Chaetozona_spp
Chrysopetallid_colormorph1
Clymenella_torquata
Cossura_longicirrata
Delaya_leruthi
Enchytraeus_albidus
Galathowenia_oculata
Galeolaria_caespitosa
Glycera_dibranchiata
Glycinde_armigera
Goniada_brunnea
Halosydna_brevisetosa
Heteromastus_filiformis
Leitoscolopus_robustus
Lumbrineris_crassicephana
Magelona_beckleyi
Myxicola_infundibulum
Nainereis_laevigata
Nephtys_incisa
Nereis_succinea
Ninoe_nigrens
Paramphinome_jeffreysii
Pectinaria_gouldii
Phascolosoma_agassizii
Poeobius_meseres
Pomatoleios_kraussii
Sabella_pacifica
Scalibregma_inflata
Schizobranchia_insignis
Sparganophilus_spp
Sternapsis_scutata
Sthenalanella_uniformis
Syllis_cf_halina
Terebellides_stoemi
Themiste_pyroides
Thysanocardia_nigra
Tomopteris_spp

All 43 listed here are present in the SS web interface in the exact same order and spelling.

Glad we are getting to the source of the issue and hope this helps the project.

Cheers,

Halocaridina

Hi Mark,

Paste error. The three on the first line should have been on three separate lines.

As I suspected this seems to be a problem with the way SS handles alias files that alias databases
that don’t match the NCBI format for multi-part dabases (nr.00, nr.01 etc.) and is obviously an extension of the mutli-part
bug as SS should really be querying any alias file it finds to ignore the individual databases it lists in favour of using the alias file.

This is why you see the ‘ignoring it’ log message - this is where SS is ignoring a db volume in favour of using the alias.
Unfortunately it doesn’t account for using aliases as a way of aggregating a large amount of individual databases!

So I’ll start working on this, but in the meantime, if you want to get around the bug for now and all you want to do is blast
all the databases you listed, you can use this alias file: http://pastebin.com/8J3FTa4J

It’s picked up on my local server at the top of the list, so if you want to blast all databases, just select “All Databases”.
It’s ugly, but it works (or should do) :slight_smile:

If you have any more problems, let me know.

Thanks,
Mark

Hi Mark,

Appreciate the effort. Unfortunately, error still persists using the alias that you provided. I actually tried the same thing yesterday (removing the absolute paths) and got the “-ignoring it” message.

One interesting behavior I noticed yesterday, which started me down the path of trying to use an alias file and might shed some light on what’s going on.

The individual FASTA files that I’m building the DBs from have anywhere from 40K-200K entries. The DBs created from them using “sequenceserver format-databases” are picked up by SS and work fine from the web interface.

Yesterday morning, I concatenated all 45 FASTA files into a single FASTA file (headers have taxon-specific tags, so easy to track) and ran “sequenceserver format-databases” against that ~4.5GB file to generate a “mega” DB. Once created in the SS DB directory where all of the other DBs are, I restarted Apache and revisited the SS page and “mega” DB wasn’t listed while everything else was.

The error was exactly the same as with the alias file: “Found a multi-part database volume at /path/to/SS/db_directory - ignoring it”

So, two different means of generating the same error. Could it be something size related, either as a single DB or the total of multiple smaller DBs?

Cheers,

Halocaridina

Hi Mark,

Appreciate the effort. Unfortunately, error still persists using the alias that you provided. I actually tried the same thing yesterday (removing the absolute paths) and got the “-ignoring it” message.

One interesting behavior I noticed yesterday, which started me down the path of trying to use an alias file and might shed some light on what’s going on.

The individual FASTA files that I’m building the DBs from have anywhere from 40K-200K entries. The DBs created from them using “sequenceserver format-databases” are picked up by SS and work fine from the web interface.

Yesterday morning, I concatenated all 45 FASTA files into a single FASTA file (headers have taxon-specific tags, so easy to track) and ran “sequenceserver format-databases” against that ~4.5GB file to generate a “mega” DB. Once created in the SS DB directory where all of the other DBs are, I restarted Apache and revisited the SS page and “mega” DB wasn’t listed while everything else was.

This is very odd as any valid formatted db should be detected by SS, regardless of its size.

The error was exactly the same as with the alias file: “Found a multi-part database volume at /path/to/SS/db_directory - ignoring it”

That message is only printed when SS finds a multi-part database with a filename that matches the regex I created, so it must be finding a pre-formatted database in your db folder.

A copy of the full log from startup would be great, that way I can trace exactly what’s happening.

You may also want to try cloning the latest code from github if you haven’t already.

Hi Mark,

Thanks again for the help. I eventually got everything worked out and the alias file that you sent is now working.

I feel a little stupid since it came down to a simple renaming of the .nal file. Originally, I had just used the same .nal file name and pasted the alias information you provided into it. It dawned on me last night after going through this thread that might be the problem.

Specifically, I changed the filename from “All_taxa_08_27_12.nal” to “All_taxa.nal” and the alias was picked up and added to the SS list of available DBs (I also check that it indeed worked by doing a BLAST). To verify that the old “All_taxa_08_27_12.nal” name was really the problem, I renamed the now working “All_taxa.nal” back to “All_taxa_08_27_12.nal”, restarted SS and it was ignored/disappeared from the DB list. Renamed it again to “All_taxa.nal”, restarted and again the alias works.

Is this expected behavior?

Cheers, thanks again and apologize for wasting your time,

Halocaridina

Hi Mark,

Thanks again for the help. I eventually got everything worked out and the alias file that you sent is now working.

I feel a little stupid since it came down to a simple renaming of the .nal file. Originally, I had just used the same .nal file name and pasted the alias information you provided into it. It dawned on me last night after going through this thread that might be the problem.

Specifically, I changed the filename from “All_taxa_08_27_12.nal” to “All_taxa.nal” and the alias was picked up and added to the SS list of available DBs (I also check that it indeed worked by doing a BLAST). To verify that the old “All_taxa_08_27_12.nal” name was really the problem, I renamed the now working “All_taxa.nal” back to “All_taxa_08_27_12.nal”, restarted SS and it was ignored/disappeared from the DB list. Renamed it again to “All_taxa.nal”, restarted and again the alias works.

Is this expected behavior?

Cheers, thanks again and apologize for wasting your time,

This isn’t expected behaviour, but I’m glad you found the cause! It’s exposed a problem in the way SS handles aliases so I’m glad you let us know - no wasted time at all.

I expect the reason it wasn’t being found was something to do with the naming of the alias file. It most likely triggered the regular expression to match it as a volume, rather than an alias.

Once this happens, SS ignores the file completely. This explains why it wasn’t showing up on the web interface and you had that message in your log about ignoring a file, despite you not having any
pre-formatted volumes in your db folder.

Glad we could help and you’ve got it all moving now, and thanks for exposing a pretty annoying bug. I had a feeling this wouldn’t be a complete fix.

Let us know if you need any more help or run into any more problems :slight_smile:

Cheers,
Mark

Halocaridina and Mark,

Specifically, I changed the filename from "All_taxa_08_27_12.nal" to
"All_taxa.nal" and the alias was picked up and added to the SS list of
available DBs (I also check that it indeed worked by doing a BLAST). To
verify that the old "All_taxa_08_27_12.nal" name was really the problem, I
renamed the now working "All_taxa.nal" back to "All_taxa_08_27_12.nal",
restarted SS and it was ignored/disappeared from the DB list. Renamed it
again to "All_taxa.nal", restarted and again the alias works.

Did you have the 'nal' extension added to your alias file residing on
your disk? Or are you just using it here for clarity -- to
distinguish alias files from the volumes and other files?

I expect the reason it wasn't being found was something to do with the
naming of the alias file. It most likely triggered the regular expression to
match it as a volume, rather than an alias.

Git says multipart database recognition was last touched by Ben in
commit 647041d9, which causes SS to ignore database files only if they
end in two digits, extension included. So SS won't ignore
'/home/yeban/sequences/All_taxa_08_27_12.nal'. Unless, there is no
'nal' at the end. Here: http://rubular.com/r/TYtj0BJWyl.

Hi Anurag,

FIrst, thanks for the notification regarding the point release. I plan to update beginning of next week.

Yes, the alias file ended with the .nal extension during all of my debugging with Mark last week.

Yes, the reason for the alias file is to search all DBs in the list using SS by just clicking a single radio button. The rational is that I’ve set up SS as part of a collaborative project involving 12+ research groups and the number of DBs will continue to grow as more transcriptomes (used to build the DBs) are sequenced. Having a single option to “Select all” was something requested by the group, so I pursued the route of using an alias file. It is a feature that I see other groups being interested in if they want to search across all DBs in a long list. For those wanting to search a subset of DBs, they are content just clicking the radio buttons of interest (or I might create aliases files for commonly used subsets in our case).

Cheers and thanks again,

Halocaridina

FIrst, thanks for the notification regarding the point release. I plan to
update beginning of next week.

Can you run `gem list sequenceserver` and report the output? I want
to know the version number of SS installed on your system.

Yes, the reason for the alias file is to search all DBs in the list using SS
by just clicking a single radio button. The rational is that I've set up SS
as part of a collaborative project involving 12+ research groups and the
number of DBs will continue to grow as more transcriptomes (used to build
the DBs) are sequenced. Having a single option to "Select all" was something
requested by the group, so I pursued the route of using an alias file. It is
a feature that I see other groups being interested in if they want to search
across all DBs in a long list. For those wanting to search a subset of DBs,
they are content just clicking the radio buttons of interest (or I might
create aliases files for commonly used subsets in our case).

Hmm. I was trying to guess if a 'select all' button would be of help.
So you list databases independently too, so a different combination
could be used. Maybe SS could do something about grouping databases
together (issue #29).

[29]: https://github.com/yannickwurm/sequenceserver/issues/29

Typos courtesy of my iPad

FIrst, thanks for the notification regarding the point release. I plan to
update beginning of next week.

Can you run `gem list sequenceserver` and report the output? I want
to know the version number of SS installed on your system.

Yes, the reason for the alias file is to search all DBs in the list using SS
by just clicking a single radio button. The rational is that I've set up SS
as part of a collaborative project involving 12+ research groups and the
number of DBs will continue to grow as more transcriptomes (used to build
the DBs) are sequenced. Having a single option to "Select all" was something
requested by the group, so I pursued the route of using an alias file. It is
a feature that I see other groups being interested in if they want to search
across all DBs in a long list. For those wanting to search a subset of DBs,
they are content just clicking the radio buttons of interest (or I might
create aliases files for commonly used subsets in our case).

Hmm. I was trying to guess if a 'select all' button would be of help.
So you list databases independently too, so a different combination
could be used. Maybe SS could do something about grouping databases
together (issue #29).

[29]: https://github.com/yannickwurm/sequenceserver/issues/29

I was actually going to suggest a database management area where you
can create/edit groups of databases. You could have it use the built
in blast tool for creating aliases and aggregating DBS together, then
you could list those groups on the main page? What do you think?

Mark,

Hi Anurag and Mark,

Anurag, here is the info on SS version #:

[19:42:51] [halocaridina@deathstar ~]$ gem list sequenceserver

*** LOCAL GEMS ***

sequenceserver (0.8.0)

[19:42:51] [halocaridina@deathstar ~]$ gem list sequenceserver

*** LOCAL GEMS ***

sequenceserver (0.8.0)
-----------------------------

I literally updated to 0.8.0 a day or two before starting this thread.

Ok, I am out of ideas then as to why SS rejects All_taxa_08_27_12.nal :-/.

I would encourage the discussion between the SS developers regarding options
on DB grouping. I think most people would like this type of feature as well
as not having the redundancy of concatenated FASTA that take up extra space.

Yep, we are on it :). Though, I will refrain from estimating the
development time for now.