python - Entrez epost + elink returns results out of order with Biopython -
i ran today , wanted toss out there. appears using the biopython interface entrez @ ncbi, it's not possible results (at least elink) in correct (same input) order. please see code below example. have thousands of gis need taxonomy information, , querying them individually painfully slow due ncbi restrictions.
from bio import entrez entrez.email = "my@email.com" ids = ["148908191", "297793721", "48525513", "507118461"] search_results = entrez.read(entrez.epost("protein", id=','.join(ids))) webenv = search_results["webenv"] query_key = search_results["querykey"] print entrez.read(entrez.elink(webenv=webenv, query_key=query_key, dbfrom="protein", db="taxonomy")) print "-------" in ids: search_results = entrez.read(entrez.epost("protein", id=i)) webenv = search_results["webenv"] query_key = search_results["querykey"] print entrez.read(entrez.elink(webenv=webenv, query_key=query_key, dbfrom="protein", db="taxonomy"))
results:
[{u'linksetdb': [{u'dbto': 'taxonomy', u'link': [{u'id': '211604'}, {u'id': '81972'}, {u'id': '32630'}, {u'id': '3332'}], u'linkname': 'protein_taxonomy'}], u'dbfrom': 'protein', u'idlist': ['148908191', '297793721', '48525513', '507118461'], u'linksetdbhistory': [], u'error': []}] ------- [{u'linksetdb': [{u'dbto': 'taxonomy', u'link': [{u'id': '3332'}], u'linkname': 'protein_taxonomy'}], u'dbfrom': 'protein', u'idlist': ['148908191'], u'linksetdbhistory': [], u'error': []}] [{u'linksetdb': [{u'dbto': 'taxonomy', u'link': [{u'id': '81972'}], u'linkname': 'protein_taxonomy'}], u'dbfrom': 'protein', u'idlist': ['297793721'], u'linksetdbhistory': [], u'error': []}] [{u'linksetdb': [{u'dbto': 'taxonomy', u'link': [{u'id': '211604'}], u'linkname': 'protein_taxonomy'}], u'dbfrom': 'protein', u'idlist': ['48525513'], u'linksetdbhistory': [], u'error': []}] [{u'linksetdb': [{u'dbto': 'taxonomy', u'link': [{u'id': '32630'}], u'linkname': 'protein_taxonomy'}], u'dbfrom': 'protein', u'idlist': ['507118461'], u'linksetdbhistory': [], u'error': []}]
the elink documentation (http://www.ncbi.nlm.nih.gov/books/nbk25499/) @ ncbi says should possible, passing multiple 'id=', doesn't appear possible biopython epost interface. has else seen or missing obvious.
thanks!
from bio import entrez entrez.email = "my@email.com" ids = ["148908191", "297793721", "48525513", "507118461"] search_results = entrez.read(entrez.epost("protein", id=','.join(ids))) xml = entrez.efetch("protein", query_key=search_results["querykey"], webenv=search_results["webenv"], rettype="gp", retmode="xml") record in entrez.read(xml): print [x[3:] x in record["gbseq_other-seqids"] if x.startswith("gi")] gb_quals = record["gbseq_feature-table"][0]["gbfeature_quals"] qualifier in gb_quals: if qualifier["gbqualifier_name"] == "db_xref": print qualifier["gbqualifier_value"] # or list comprehension # print [q["gbqualifier_value"] q in # record["gbseq_feature-table"][0]["gbfeature_quals"] if # q["gbqualifier_name"] == "db_xref"] xml.close()
i efetch
query, , parse-like xml after read entrez.read()
. things turn messy, , have dive xml-dict-list. guess there's way extract "gbfeature_quals" "gbqualifier_name" "db_xref" nicer mine... works (by now). output:
['148908191'] taxon:3332 ['297793721'] taxon:81972 ['48525513'] taxon:211604 ['507118461'] taxon:32630
Comments
Post a Comment