SequenceServer on large sequences gives high memory usage?

Hi there
I was wondering if anyone has seen or has remedies for a scenario that I have seen a couple times now that ends up causing huge memory usage on SequenceServer.

Today I checked the server and saw this in the logs

9403 me 20 0 88.0g 69g 2636 S 99.9 55.1 104:46.10 ruby

22527 me 18 0 30.6g 29g 14m D 64.7 23.5 257:01.42 blastn

That’s a combined 100GB memory…I was forced to kill that job as the server can’t really handle that, and I think it was just from one bad query.

Has anyone seen this happen?

I have a 30GB file for the event in my tmp files that looks like it might have caused this

`
Blast4-archive ::= {
request {
ident “2.2.31+”,
body queue-search {
program “blastn”,
service “plain”,
queries bioseq-set {
seq-set {
seq {
id {
local str “Query_1”
},
descr {
user {
type str “CFastaReader”,
data {
{
label str “DefLine”,
data str “>”
}
}
}
},
inst {
repr raw,
mol na,
length 63994,
seq-data ncbi4na '2228224444428221841228221424484284444121212188
184848488421441421448421418448418841212284482822848214414428822828228848142148
888828444412288282842188218412828814481228228214228284144414118448448844418144
441224818228212842124111444124482828814448842882241884824288144222488282844121
888448481142824281142221214414144282848484188214244844442821421214214442214811
124148814221424482142148481848421428441282141211214144848181112221444221444114
112222228844844842811244112841212282128212142182142218148828242218441412821424
241228221242214228414181411428414828428412122121212141842144114184111148884848
248284848822842182818482848418482282141288122882812182112284444844448222118411
882221844141411281841181184482841484844888214282888414212212241444424882282228
821281188412144282814814484214814288284822888444844888812442141814144121211128
111444114288881288811111211882281288418182282142888288482114281841422122141148
881812414412114828882112414141421221812182811441181812141181882228418411811141
142218848411212248288248488814221822282284128482821844112111212812288814882184
111288148418888111821288848828118421842184281482128221482184821422828881221212
814441284814222188144282222848221841118888221442114118128441424448842218822282
282214442184418228484828288121828428818844814484448822881821281424221828842418


`

The numbers stop after a bit but there is just a lot of text after that…30GB worth evidently. Has anyone seen anything like this?

I am running on apache with SequenceServer 1.0.4

-Colin

That’s BLAST’s archive file. I was under the impression that we were generating archive files in binary format – will look into that later.

Since BLAST doesn’t store query data anywhere, a huge file would imply too many hits. From casual observation, BLAST doesn’t properly stream output when generating XML output (from archive file or directly) – if you will observe the file to which output is to be written, it remains empty almost till BLAST is done executing and memory usage continues to rise until then. Especially noticeable when you have a sizeable query against say NR db.

There are other cases where BLAST should stream and it doesn’t, like when using outfmt param with blastdbcmd, but that’s unlikely to be the issue here.

Another issue regarding memory usage is likely SequenceServer’s XML parsing. But this happens after BLAST is done generating XML. So they don’t contribute to memory usage at once. XML parsing is done in memory – I assumed XML files no bigger than 2GB.

I have seen extreme memory usage only with NR database. Are you hosting NR database? Did this particular query have too many sequences (for one sequence BLAST default to max 500 hits)?

Could you check if you have globally unique sequence ids in the dbs? I have previously seen this to be a reason for XML file to be malformed. Maybe it could contribute to output not being flushed as well?

— Priyam

I guess this was created when BLASTing against a reference genome. I have always been a little wary against performing this type of search in the first place. The user inputs a 64kb sequence, using a ref genome of like 2+GB, and it is using 500 Hits apparently, but it just creates huge amounts of HSP in addition to that. Maybe I have to tune the parameters for creating the reference genome database…do you have any tips for that?

I think I have also noticed the output formatter using lots of resources and would agree that if it could stream that would be great.

Thanks for the tips

-Colin

Hmm. Too many HSPs could be a reason.

While creating BLASTdb there’s only the masking related options that could be relevant, but I think not. Though, while performing BLAST -max_hsps option can be used to limit the number of HSPs. Don’t know if that’s a good idea though. You could try limiting the number of hits returned with -max_target_seqs or with evalue of 1e-3 or so. Would be easier with next release of SS which will allow setting default params per algorithm and shows the params that were used for BLAST on the results page. The latter aspect being more important.

Till then you can surely implement this scheme by modifying blast.rb and adding a banner at the top of the page saying number of hits are limited so and so.

— Priyam