$Id: BIODATABASE-ACCESS-HOWTO.txt 2590 2003-03-17 20:46:24Z kdj $

INTRODUCTION

Importing sequences with annotations is a central part of most
bioinformatics tasks.  BioJava supports importing sequences from
indexed flat-files, local relational databases and remote (internet)
databases. Previously, separate programming syntax was required for
each of these types of data access. In addition, if one wanted to
change one's mode of sequence-data acquisition (for example, by
implementing a local relational database version of Genbank when
previously the data had been stored in an indexed flat-file) one had
to rewrite all of the data-access subroutines in one's application
code.

The Open Biological Database Access (OBDA) System was designed so that
one could use the same application code to access data from all three
of the database types by simply changing a few lines in a
"configuration file". This makes application code more portable and
easier to maintain. This document shows how to set up the required
OBDA-registry configuration file and how to access data from the
databases referred to in the configuration file using the BioJava API
as well as from the command line.

Note: accessing data via the OBDA system is optional in BioJava. It is
still possible to access sequence data via the API in the
org.biojava.bio.seq.db package, including direct access to the binary
flatfile indexing system used by EMBOSS (that is, you can instantiate
BioJava SequenceDB objects using EMBOSS indices).

USING THE OBDA BIODIRECTORY REGISTRY SYSTEM TO ACCESS SEQUENCE DATABASES

The OBDA BioDirectory Registry is a platform-independent system for
specifying how BioJava programs find sequence databases. It uses a
site-wide configuration file, known as the registry, which defines one
or more databases and the access methods to use to access them.

For instance, you might start out by accessing sequence data over the
web, and later decide to install a locally mirrored copy of Genbank.
By changing one line in the registry file, all
Bio{Perl,Java,Python,Ruby} applications will start using the mirrored
local copy automatically - no source code changes are necessary.

INSTALLING THE REGISTRY FILE

The registry file should be named seqdatabase.ini.  By default, it
should be installed in one of the following locations:

   $HOME/.bioinformatics/seqdatabase.ini
   /etc/bioinformatics/seqdatabase.ini

The Bio{Perl,Java,Python,Ruby} registry-handling code will initialize
itself from the registry file located in the home directory first,
followed by the system-wide default in /etc.

If no local registry file cannot be found, the registry-handling code
will take its configuration from the file located at this URL:

   http://www.open-bio.org/registry/seqdatabase.ini

MODIFYING THE SEARCH PATH

The registry file search path can be modified by setting the
environment variable OBDA_SEARCH_PATH. This variable is a "+"
delimited string of files and URLs, for example:

 OBDA_SEARCH_PATH=/home/lstein/seqdatabase.ini+http://foo.org/seqdatabase.ini

The search order proceeds from left to right. The first file or URL
that is found ends the search.

Warning! Note that the fact that the search path is for an entire file
(seqdatabase.ini) rather than for single entry (e.g. 'genbank') means
that you have to copy any default values you want to keep from the
(old) default configuration file to your new configuration file.

For example, say you have been using biofetch with the default
configuration file http://www.open-bio.org/registry/seqdatabase.ini
for all your sequence-data retrieval.  If you now install a local copy
of genbank, your local seqdatabase.ini must not only have a "stanza"
indicating that 'genbank' is local but it must have stanzas
configuring the web access for all the other databases you use, since 
http://www.open-bio.org/registry/seqdatabase.ini will no longer be
found in a registry-file search.

============================================================================

FORMAT OF THE REGISTRY FILE

The registry file is a simple text file, as shown in the following
example:

----------------- example starts --------------
 VERSION=1.00

 [embl]
 protocol=biofetch
 location=http://www.ebi.ac.uk/cgi-bin/dbfetch
 dbname=embl

 [swissprot]
 protocol=biofetch
 location=http://www.ebi.ac.uk/cgi-bin/dbfetch
 dbname=swall
------------------ example ends ---------------

The first line is the registry format version number in the format
VERSION=X.XX.  The current version is 1.00.

The file remainder is a simple stanza format which goes:

  [database-name]
  tag=value
  tag=value

  [database-name]
  tag=value
  tag=value

Each stanza starts with a symbolic database service name enclosed in
square brackets.  Service names are case insensitive.  The remainder
of the stanza is followed by a series of tag=value pairs that
configure access to the service.

Database-name stanzas can be repeated, in which case the client should
try each service in turn from top to bottom.

The options under each stanza must have two non-optional tag=value
lines being

  protocol=<protocol-type>
  location=<location-string>

The Protocol Tag
----------------

The protocol tag species what access mode to use.  Currently it can be
one of:

  flat
  biofetch
  biosql

"flat" is used to fetch sequences from local flat files that have been
indexed using binary search indexing.

"biofetch" is used to fetch sequences from web-based databses. Due to
restrictions on the use of these databases, this is recommended only
for lightweight applications.

"biosql" fetches sequences from BioSQL databases. To use this protocol
you will need to set up an SQL database using the API in the
org.biojava.bio.seq.db.biosql package.

The Location Tag
----------------

The location tag tells the bioperl sequence fetching code where the
database is located. Its interpretation depends on the protocol
chosen. For example, it might be a directory on the local file system,
or a remote URL.

Other Tags
----------

Any number of additional tag values are allowed. The number and nature
of these tags depends on the access protocol selected. Some protocols
require no additional tags, whereas others will require several.

 Protocol       Tag        Description
 --------       ---        -----------

 flat           location   Directory in which the index is stored.
                           The "config.dat" file generated during
                           indexing must be found in this location.

                dbname     Name of database.

 biofetch       location   Base URL for the web service.  Currently
                           the only compatible biofetch service is 
			   http://www.ebi.ac.uk/cgi-bin/dbfetch

                dbname     Name of the database.  Currently recognized
                           values are "embl" (sequence and protein),
                           "swall" (SwissProt + TREMBL), and "refseq"
                           (NCBI RefSeq entries).

 biosql		location   <host:port>
		dbname     <database_name>
		driver     mysql|postgres|oracle|sybase|sqlserver|access
                           |csv|informix|odbc|rdb
		user       <username>
		passwd     <passwd>
		biodbname  <biodatabase name>


============================================================================

INSTALLING LOCAL DATABASES

If you are using the biofetch protocol, you're all set. You can start
reading sequences immediately. For the flat and biosql protocols, you
will need to create and initialize local databases. See the following
documentation on how to do this:

   flat protocol:    See FLAT-DATABASES-HOWTO.txt (in the docs/howto subdirectory)
   biosql protocol:  See BIOSQL-HOWTO.txt (this doc is still being developed)

============================================================================


WRITING CODE TO USE THE REGISTRY

Once you've set up the OBDA registry file, accessing sequence data
from within a Java program is simple. The following example shows how;
note that nowhere in the program do you explicitly specify whether the
data is stored in a flat file, a local relational database or a
database on the internet.

To use the registry from a Java program, use the following idiom:

    1 import org.biojava.directory.Registry;
    2 Registry registry = Registry.instance();
    3 SequenceDBLite db = registry.getDatabase("embl");
    4 Sequence seq = db.getSequence("J02231");
    5 SeqIOTools.writeFasta(System.out, seq);

In lines 1 and 2, we import the Registry class and obtain a reference
to the singleton Registry object. We then ask the registry to return a
database accessor for the symbolic data source "embl", which must be
defined in an [embl] stanza in the seqdatabase.ini registry file.

The returned accessor is a SequenceDBLite object (see the appropriate
JavaDoc page), which has amongst its methods:

   db.getSequence(id);

These method returns a Sequence object by searching for its primary
ID. In line 5, we call the SeqIOTools utility object's static
writeFasta method to print out the DNA or protein sequence.

============================================================================

USING BIOGETSEQ TO ACCESS REGISTRY SEQUENCES FROM THE COMMAND LINE

As a convenience, the BioJava distribution includes a program
'BioGetSeq' that enables one to have OBDA access to sequence data from
the command line.

The program 'BioGetSeq' is located at the apps directory of the
BioJava distribution. Move or add it into your path to run it.

You can get to help by running it with no arguments:

Usage: org.biojava.app.BioGetSeq --dbname embl --format embl \
        --namespace id [ id ... ]*

       dbname defaults to embl
       format defaults to embl
       namespace defaults to 'id' ['id' being the only supported namespace]
       rest of the arguments is a list of ids in the given namespace

If you have a set of ids you want to fetch from EMBL database, you
just give them as space separated parameters:

  % java org.biojava.app.BioGetSeq J02231 A21530 A10516

The output is directed to standard out, so it can be redirected to a
file. The options can be given in long "double hyphen" format or
abbreviated to one letter format:

  % java org.biojava.app.BioGetSeq -f fasta --namespace id J02231 \
     A21530 A10516 > file.seq


---------------------------------------------------------------------------- 

ChangeLog

$Log$
Revision 1.1  2003/03/14 14:58:30  kdj
Imported from BioPerl, appropriate modifications

