SequenceServer on large sequences gives high memory usage?

Hi there
I was wondering if anyone has seen or has remedies for a scenario that I have seen a couple times now that ends up causing huge memory usage on SequenceServer.

Today I checked the server and saw this in the logs

9403 me 20 0 88.0g 69g 2636 S 99.9 55.1 104:46.10 ruby

22527 me 18 0 30.6g 29g 14m D 64.7 23.5 257:01.42 blastn

That’s a combined 100GB memory…I was forced to kill that job as the server can’t really handle that, and I think it was just from one bad query.

Has anyone seen this happen?

I have a 30GB file for the event in my tmp files that looks like it might have caused this

Blast4-archive ::= {
request {
ident “2.2.31+”,
body queue-search {
program “blastn”,
service “plain”,
queries bioseq-set {
seq-set {
seq {
id {
local str “Query_1”
descr {
user {
type str “CFastaReader”,
data {
label str “DefLine”,
data str “>”
inst {
repr raw,
mol na,
length 63994,
seq-data ncbi4na '2228224444428221841228221424484284444121212188


The numbers stop after a bit but there is just a lot of text after that…30GB worth evidently. Has anyone seen anything like this?

I am running on apache with SequenceServer 1.0.4


That’s BLAST’s archive file. I was under the impression that we were generating archive files in binary format – will look into that later.

Since BLAST doesn’t store query data anywhere, a huge file would imply too many hits. From casual observation, BLAST doesn’t properly stream output when generating XML output (from archive file or directly) – if you will observe the file to which output is to be written, it remains empty almost till BLAST is done executing and memory usage continues to rise until then. Especially noticeable when you have a sizeable query against say NR db.

There are other cases where BLAST should stream and it doesn’t, like when using outfmt param with blastdbcmd, but that’s unlikely to be the issue here.

Another issue regarding memory usage is likely SequenceServer’s XML parsing. But this happens after BLAST is done generating XML. So they don’t contribute to memory usage at once. XML parsing is done in memory – I assumed XML files no bigger than 2GB.

I have seen extreme memory usage only with NR database. Are you hosting NR database? Did this particular query have too many sequences (for one sequence BLAST default to max 500 hits)?

Could you check if you have globally unique sequence ids in the dbs? I have previously seen this to be a reason for XML file to be malformed. Maybe it could contribute to output not being flushed as well?

— Priyam

I guess this was created when BLASTing against a reference genome. I have always been a little wary against performing this type of search in the first place. The user inputs a 64kb sequence, using a ref genome of like 2+GB, and it is using 500 Hits apparently, but it just creates huge amounts of HSP in addition to that. Maybe I have to tune the parameters for creating the reference genome database…do you have any tips for that?

I think I have also noticed the output formatter using lots of resources and would agree that if it could stream that would be great.

Thanks for the tips


Hmm. Too many HSPs could be a reason.

While creating BLASTdb there’s only the masking related options that could be relevant, but I think not. Though, while performing BLAST -max_hsps option can be used to limit the number of HSPs. Don’t know if that’s a good idea though. You could try limiting the number of hits returned with -max_target_seqs or with evalue of 1e-3 or so. Would be easier with next release of SS which will allow setting default params per algorithm and shows the params that were used for BLAST on the results page. The latter aspect being more important.

Till then you can surely implement this scheme by modifying blast.rb and adding a banner at the top of the page saying number of hits are limited so and so.

— Priyam