Multi-part database volume

Hi Mark and Anurag,

Thanks for the quick reply. Answers to your questions are:

  1. The .nal file is in the same directory as the DBs created using “sequenceserver format-databases”.

  2. Each DB has six files associated with it having extensions: .nhr .nin .nog .nsd .nsi .nsq

  3. For each DB in that directory, there is a symlink to the fasta file that was used to build the database and the symlink shares the same name as each DB. Thinking that having the symlinks and DBs sharing the same name in the same directory might be the problem, I moved the symlinks out of the directory, restarted Apache and revisited the SS page. Same “- ignoring it” message.

  4. The .nal file has the following structure:

DBLIST /home/data_processed/db_4_blast/Abarenicola_pacifica /home/data_processed/db_4_blast/Alciopa_spp [absolute paths to the other 43 DBs, each separated by a space]

  1. I’m using the basenames of each DB (i.e.,no extensions) with an absolute path in the .nal file, which should be correct syntax as Anurag pointed out.

I think I have an idea of the problem now, but just to confirm would you be able either list the databases SS lists in the web interface or upload a screenshot? imgur.com is fine if you don’t mind it being public.

This probably has something to do with the way SS determines what is and isn’t part of a database set.

Hi Mark,

Sure, the list is:

Abarenicola_pacificaAlciopa_sppAncistrosyllis_groenlandicaAphroditida_japonica
Axiothella_rubrocincta
Boccardia_proboscidea
Chaetozona_spp
Chrysopetallid_colormorph1
Clymenella_torquata
Cossura_longicirrata
Delaya_leruthi
Enchytraeus_albidus
Galathowenia_oculata
Galeolaria_caespitosa
Glycera_dibranchiata
Glycinde_armigera
Goniada_brunnea
Halosydna_brevisetosa
Heteromastus_filiformis
Leitoscolopus_robustus
Lumbrineris_crassicephana
Magelona_beckleyi
Myxicola_infundibulum
Nainereis_laevigata
Nephtys_incisa
Nereis_succinea
Ninoe_nigrens
Paramphinome_jeffreysii
Pectinaria_gouldii
Phascolosoma_agassizii
Poeobius_meseres
Pomatoleios_kraussii
Sabella_pacifica
Scalibregma_inflata
Schizobranchia_insignis
Sparganophilus_spp
Sternapsis_scutata
Sthenalanella_uniformis
Syllis_cf_halina
Terebellides_stoemi
Themiste_pyroides
Thysanocardia_nigra
Tomopteris_spp

All 43 listed here are present in the SS web interface in the exact same order and spelling.

Glad we are getting to the source of the issue and hope this helps the project.

Cheers,

Halocaridina

Hi Mark,

Paste error. The three on the first line should have been on three separate lines.

As I suspected this seems to be a problem with the way SS handles alias files that alias databases
that don’t match the NCBI format for multi-part dabases (nr.00, nr.01 etc.) and is obviously an extension of the mutli-part
bug as SS should really be querying any alias file it finds to ignore the individual databases it lists in favour of using the alias file.

This is why you see the ‘ignoring it’ log message - this is where SS is ignoring a db volume in favour of using the alias.
Unfortunately it doesn’t account for using aliases as a way of aggregating a large amount of individual databases!

So I’ll start working on this, but in the meantime, if you want to get around the bug for now and all you want to do is blast
all the databases you listed, you can use this alias file: http://pastebin.com/8J3FTa4J

It’s picked up on my local server at the top of the list, so if you want to blast all databases, just select “All Databases”.
It’s ugly, but it works (or should do) :slight_smile:

If you have any more problems, let me know.

Thanks,
Mark

Hi Mark,

Appreciate the effort. Unfortunately, error still persists using the alias that you provided. I actually tried the same thing yesterday (removing the absolute paths) and got the “-ignoring it” message.

One interesting behavior I noticed yesterday, which started me down the path of trying to use an alias file and might shed some light on what’s going on.

The individual FASTA files that I’m building the DBs from have anywhere from 40K-200K entries. The DBs created from them using “sequenceserver format-databases” are picked up by SS and work fine from the web interface.

Yesterday morning, I concatenated all 45 FASTA files into a single FASTA file (headers have taxon-specific tags, so easy to track) and ran “sequenceserver format-databases” against that ~4.5GB file to generate a “mega” DB. Once created in the SS DB directory where all of the other DBs are, I restarted Apache and revisited the SS page and “mega” DB wasn’t listed while everything else was.

The error was exactly the same as with the alias file: “Found a multi-part database volume at /path/to/SS/db_directory - ignoring it”

So, two different means of generating the same error. Could it be something size related, either as a single DB or the total of multiple smaller DBs?

Cheers,

Halocaridina

Hi Mark,

Appreciate the effort. Unfortunately, error still persists using the alias that you provided. I actually tried the same thing yesterday (removing the absolute paths) and got the “-ignoring it” message.

One interesting behavior I noticed yesterday, which started me down the path of trying to use an alias file and might shed some light on what’s going on.

The individual FASTA files that I’m building the DBs from have anywhere from 40K-200K entries. The DBs created from them using “sequenceserver format-databases” are picked up by SS and work fine from the web interface.

Yesterday morning, I concatenated all 45 FASTA files into a single FASTA file (headers have taxon-specific tags, so easy to track) and ran “sequenceserver format-databases” against that ~4.5GB file to generate a “mega” DB. Once created in the SS DB directory where all of the other DBs are, I restarted Apache and revisited the SS page and “mega” DB wasn’t listed while everything else was.

This is very odd as any valid formatted db should be detected by SS, regardless of its size.

The error was exactly the same as with the alias file: “Found a multi-part database volume at /path/to/SS/db_directory - ignoring it”

That message is only printed when SS finds a multi-part database with a filename that matches the regex I created, so it must be finding a pre-formatted database in your db folder.

A copy of the full log from startup would be great, that way I can trace exactly what’s happening.

You may also want to try cloning the latest code from github if you haven’t already.

Hi Mark,

Thanks again for the help. I eventually got everything worked out and the alias file that you sent is now working.

I feel a little stupid since it came down to a simple renaming of the .nal file. Originally, I had just used the same .nal file name and pasted the alias information you provided into it. It dawned on me last night after going through this thread that might be the problem.

Specifically, I changed the filename from “All_taxa_08_27_12.nal” to “All_taxa.nal” and the alias was picked up and added to the SS list of available DBs (I also check that it indeed worked by doing a BLAST). To verify that the old “All_taxa_08_27_12.nal” name was really the problem, I renamed the now working “All_taxa.nal” back to “All_taxa_08_27_12.nal”, restarted SS and it was ignored/disappeared from the DB list. Renamed it again to “All_taxa.nal”, restarted and again the alias works.

Is this expected behavior?

Cheers, thanks again and apologize for wasting your time,

Halocaridina

Hi Mark,

Thanks again for the help. I eventually got everything worked out and the alias file that you sent is now working.

I feel a little stupid since it came down to a simple renaming of the .nal file. Originally, I had just used the same .nal file name and pasted the alias information you provided into it. It dawned on me last night after going through this thread that might be the problem.

Specifically, I changed the filename from “All_taxa_08_27_12.nal” to “All_taxa.nal” and the alias was picked up and added to the SS list of available DBs (I also check that it indeed worked by doing a BLAST). To verify that the old “All_taxa_08_27_12.nal” name was really the problem, I renamed the now working “All_taxa.nal” back to “All_taxa_08_27_12.nal”, restarted SS and it was ignored/disappeared from the DB list. Renamed it again to “All_taxa.nal”, restarted and again the alias works.

Is this expected behavior?

Cheers, thanks again and apologize for wasting your time,

This isn’t expected behaviour, but I’m glad you found the cause! It’s exposed a problem in the way SS handles aliases so I’m glad you let us know - no wasted time at all.

I expect the reason it wasn’t being found was something to do with the naming of the alias file. It most likely triggered the regular expression to match it as a volume, rather than an alias.

Once this happens, SS ignores the file completely. This explains why it wasn’t showing up on the web interface and you had that message in your log about ignoring a file, despite you not having any
pre-formatted volumes in your db folder.

Glad we could help and you’ve got it all moving now, and thanks for exposing a pretty annoying bug. I had a feeling this wouldn’t be a complete fix.

Let us know if you need any more help or run into any more problems :slight_smile:

Cheers,
Mark

Halocaridina and Mark,

Specifically, I changed the filename from "All_taxa_08_27_12.nal" to
"All_taxa.nal" and the alias was picked up and added to the SS list of
available DBs (I also check that it indeed worked by doing a BLAST). To
verify that the old "All_taxa_08_27_12.nal" name was really the problem, I
renamed the now working "All_taxa.nal" back to "All_taxa_08_27_12.nal",
restarted SS and it was ignored/disappeared from the DB list. Renamed it
again to "All_taxa.nal", restarted and again the alias works.

Did you have the 'nal' extension added to your alias file residing on
your disk? Or are you just using it here for clarity -- to
distinguish alias files from the volumes and other files?

I expect the reason it wasn't being found was something to do with the
naming of the alias file. It most likely triggered the regular expression to
match it as a volume, rather than an alias.

Git says multipart database recognition was last touched by Ben in
commit 647041d9, which causes SS to ignore database files only if they
end in two digits, extension included. So SS won't ignore
'/home/yeban/sequences/All_taxa_08_27_12.nal'. Unless, there is no
'nal' at the end. Here: http://rubular.com/r/TYtj0BJWyl.

Hi Anurag,

FIrst, thanks for the notification regarding the point release. I plan to update beginning of next week.

Yes, the alias file ended with the .nal extension during all of my debugging with Mark last week.

Yes, the reason for the alias file is to search all DBs in the list using SS by just clicking a single radio button. The rational is that I’ve set up SS as part of a collaborative project involving 12+ research groups and the number of DBs will continue to grow as more transcriptomes (used to build the DBs) are sequenced. Having a single option to “Select all” was something requested by the group, so I pursued the route of using an alias file. It is a feature that I see other groups being interested in if they want to search across all DBs in a long list. For those wanting to search a subset of DBs, they are content just clicking the radio buttons of interest (or I might create aliases files for commonly used subsets in our case).

Cheers and thanks again,

Halocaridina

FIrst, thanks for the notification regarding the point release. I plan to
update beginning of next week.

Can you run `gem list sequenceserver` and report the output? I want
to know the version number of SS installed on your system.

Yes, the reason for the alias file is to search all DBs in the list using SS
by just clicking a single radio button. The rational is that I've set up SS
as part of a collaborative project involving 12+ research groups and the
number of DBs will continue to grow as more transcriptomes (used to build
the DBs) are sequenced. Having a single option to "Select all" was something
requested by the group, so I pursued the route of using an alias file. It is
a feature that I see other groups being interested in if they want to search
across all DBs in a long list. For those wanting to search a subset of DBs,
they are content just clicking the radio buttons of interest (or I might
create aliases files for commonly used subsets in our case).

Hmm. I was trying to guess if a 'select all' button would be of help.
So you list databases independently too, so a different combination
could be used. Maybe SS could do something about grouping databases
together (issue #29).

[29]: https://github.com/yannickwurm/sequenceserver/issues/29

Typos courtesy of my iPad

FIrst, thanks for the notification regarding the point release. I plan to
update beginning of next week.

Can you run `gem list sequenceserver` and report the output? I want
to know the version number of SS installed on your system.

Yes, the reason for the alias file is to search all DBs in the list using SS
by just clicking a single radio button. The rational is that I've set up SS
as part of a collaborative project involving 12+ research groups and the
number of DBs will continue to grow as more transcriptomes (used to build
the DBs) are sequenced. Having a single option to "Select all" was something
requested by the group, so I pursued the route of using an alias file. It is
a feature that I see other groups being interested in if they want to search
across all DBs in a long list. For those wanting to search a subset of DBs,
they are content just clicking the radio buttons of interest (or I might
create aliases files for commonly used subsets in our case).

Hmm. I was trying to guess if a 'select all' button would be of help.
So you list databases independently too, so a different combination
could be used. Maybe SS could do something about grouping databases
together (issue #29).

[29]: https://github.com/yannickwurm/sequenceserver/issues/29

I was actually going to suggest a database management area where you
can create/edit groups of databases. You could have it use the built
in blast tool for creating aliases and aggregating DBS together, then
you could list those groups on the main page? What do you think?

Mark,

Hi Anurag and Mark,

Anurag, here is the info on SS version #:

[19:42:51] [halocaridina@deathstar ~]$ gem list sequenceserver

*** LOCAL GEMS ***

sequenceserver (0.8.0)

[19:42:51] [halocaridina@deathstar ~]$ gem list sequenceserver

*** LOCAL GEMS ***

sequenceserver (0.8.0)
-----------------------------

I literally updated to 0.8.0 a day or two before starting this thread.

Ok, I am out of ideas then as to why SS rejects All_taxa_08_27_12.nal :-/.

I would encourage the discussion between the SS developers regarding options
on DB grouping. I think most people would like this type of feature as well
as not having the redundancy of concatenated FASTA that take up extra space.

Yep, we are on it :). Though, I will refrain from estimating the
development time for now.

Hi Anurag,

I’ll keep an eye on my system and “play” around with filenames in the meantime to see I can help shine some light on this.

Thanks, totally understandable regarding development cycles. Now that I know the “trick” for naming alias files, it will be nothing but a little combination of grep/sed/awk to take care of populating them for my production purposes.

Cheers,

Halocaridina

Good Day,

as well thanks for developing and maintaining SequenceServer. I believe I have a similar problem. I am trying to work with the MD5nr database, which as well uses an alias file:

`

$ cat md5nr.pal

Hi Bert,

I wonder what stage seqserv is being slow on. What happens if you run a query (and so get the spinning wheel), and then look at the command line where seqserv is running from? Also, if you use the command “top”, what process is taking the most CPU?

ben

Hi Ben,

thanks for the quick reply.

Serverside output:

`

D, [2013-08-29T09:13:12.854989 #10647] DEBUG – : method: blastp
D, [2013-08-29T09:13:12.855219 #10647] DEBUG – : sequence: MTGTTGATAWR
D, [2013-08-29T09:13:12.855494 #10647] DEBUG – : database: [“69d7ff233621b78e5ef844130befbae9”]
D, [2013-08-29T09:13:12.855640 #10647] DEBUG – : advanced:

`

top:

`
top - 09:15:03 up 1 day, 16:50, 3 users, load average: 0,77, 0,31, 0,15
Tasks: 144 total, 2 running, 142 sleeping, 0 stopped, 0 zombie
%Cpu(s): 20,9 us, 7,7 sy, 0,0 ni, 26,0 id, 45,1 wa, 0,0 hi, 0,3 si, 0,0 st
KiB Mem: 1017684 total, 950224 used, 67460 free, 31272 buffers
KiB Swap: 2074620 total, 370312 used, 1704308 free, 602548 cached

D, [2013-08-29T09:14:38.228354 #10707] DEBUG – : sequence: MTGTTGATAWR
10742 jfk 20 0 2301m 308m 306m R 50,4 31,1 0:12.15 blastp
23 root 20 0 0 0 0 S 9,3 0,0 0:54.39 kswapd0
9096 root 20 0 0 0 0 S 0,3 0,0 0:01.67 kworker/0:2
10737 jfk 20 0 23300 1644 1136 R 0,3 0,2 0:00.12 top

`

The runtime is so high, since it does not seem to be stopped since my last trials yesterday, despite me killing the process.

WOW…it suddenly worked with a short query. The output also indicates that the full database is read. Thank you and sorry for bothering…I simply need more powerful hardware and some old machine i had lying around…