Man pages sections > man3 > Bio::DB::Fasta

Bio::DB::Fasta - Fast indexed access to fasta files

Bio::DB::Fasta(3pm) User Contributed Perl Documentation Bio::DB::Fasta(3pm)

NAME

Bio::DB::Fasta - Fast indexed access to fasta files

SYNOPSIS

  use Bio::DB::Fasta;
  # Create database from a directory of Fasta files
  my $db       = Bio::DB::Fasta->new('/path/to/fasta/files/');
  my @ids      = $db->get_all_primary_ids;
  # Simple access
  my $seqstr   = $db->seq('CHROMOSOME_I', 4_000_000 => 4_100_000);
  my $revseq   = $db->seq('CHROMOSOME_I', 4_100_000 => 4_000_000);
  my $length   = $db->length('CHROMOSOME_I');
  my $header   = $db->header('CHROMOSOME_I');
  my $alphabet = $db->alphabet('CHROMOSOME_I');
  # Access to sequence objects. See Bio::PrimarySeqI.
  my $seq     = $db->get_Seq_by_id('CHROMOSOME_I');
  my $seqstr  = $seq->seq;
  my $subseq  = $seq->subseq(4_000_000 => 4_100_000);
  my $trunc   = $seq->trunc(4_000_000 => 4_100_000);
  my $length  = $seq->length;
  # Loop through sequence objects
  my $stream  = $db->get_PrimarySeq_stream;
  while (my $seq = $stream->next_seq) {
    # Bio::PrimarySeqI stuff
  }
  # Filehandle access
  my $fh = Bio::DB::Fasta->newFh('/path/to/fasta/files/');
  while (my $seq = <$fh>) {
    # Bio::PrimarySeqI stuff
  }
  # Tied hash access
  tie %sequences,'Bio::DB::Fasta','/path/to/fasta/files/';
  print $sequences{'CHROMOSOME_I:1,20000'};

DESCRIPTION

Bio::DB::Fasta provides indexed access to a single Fasta file, several files, or a directory of files. It provides persistent random access to each sequence entry (either as a Bio::PrimarySeqI-compliant object or a string), and to subsequences within each entry, allowing you to retrieve portions of very large sequences without bringing the entire sequence into memory. Bio::DB::Fasta is based on Bio::DB::IndexedBase. See this module's documentation for details.
The Fasta files may contain any combination of nucleotide and protein sequences; during indexing the module guesses the molecular type. Entries may have any line length up to 65,536 characters, and different line lengths are allowed in the same file. However, within a sequence entry, all lines must be the same length except for the last. An error will be thrown if this is not the case.
The module uses /^>(\S+)/ to extract the primary ID of each sequence from the Fasta header. See -makeid in Bio::DB::IndexedBase to pass a callback routine to reversibly modify this primary ID, e.g. if you wish to extract a specific portion of the gi|gb|abc|xyz GenBank IDs.

DATABASE CREATION AND INDEXING

The object-oriented constructor is new(), the filehandle constructor is newFh() and the tied hash constructor is tie(). They all allow to index a single Fasta file, several files, or a directory of files. See Bio::DB::IndexedBase.

SEE ALSO

Bio::DB::IndexedBase
Bio::DB::Qual
Bio::PrimarySeqI

AUTHOR

Lincoln Stein <lstein@cshl.org>.
Copyright (c) 2001 Cold Spring Harbor Laboratory.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See DISCLAIMER.txt for disclaimers of warranty.

APPEN