NCBIx-BigFetch version 0.0.1 This module is useful for downloading very large result sets of sequences from NCBI given a text query. It uses YAML to create a configuration file to maintain project state in case network or server issues interrupts execution, in which case it may be easily restarted after the last batch. Working scripts are included in the script directory: fetch-all.pp fetch-missing.pp fetch-unavailable.pp The recommended workflow is: 1. Copy the scripts and edit them for a specific project. Use a new number as the project ID. 2. Begin downloading by running fetch-all.pp, which will first submit a query and save the resulting WebEnv key in a project specific configuration file (using YAML). 3. The next morning, kill the fetch-all.pp process and run fetch-missing.pp until it completes. 4. Restart fetch-all.pp. 5. If you wish to re-download "not available" sequences, you may run fetch-unavailable.pp. However, they will be downloaded at the end of fetch-all.pp if it completes normally. If your query result set is so large that your WebEnv times out, simply start a new project with that last index of the previous project, and it will pick up the result set from there (with a new WebEnv). Warning: You may lose a (very) few sequences if your download extends across multiple projects. However, our testing shows that the results generated with the same query within a few days of each other are largely in the same order. Note: This module was used to download 11,550,000 fasta-formatted protein sequences over the course of eight days. INSTALLATION To install this module, run the following commands: perl Makefile.PL make make test make install DEPENDENCIES Class::Std Class::Std::Utils LWP::Simple YAML Time::HiRes Bio::SeqIO COPYRIGHT AND LICENSE Copyright (C) 2009, Roger A Hall This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.