The pbcore.io package provides a number of lightweight interfaces to PacBio data files and other standard bioinformatics file formats. Preferred usage is to import classes directly from the pbcore.io package, e.g.:
>>> from pbcore.io import CmpH5Reader
The classes within pbcore.io adhere to a few conventions, in order to provide a uniform API:
Each data file type is thought of as a container of a Record type; all Reader classes support streaming access, and CmpH5Reader and BasH5Reader additionally provide random-access to alignments/reads.
The constructor argument needed to instantiate Reader and Writer objects can be either a filename (which can be suffixed by ”.gz” for all but the h5 file types) or an open file handle. The reader/writer classes will do what you would expect.
The reader/writer classes all support the context manager idiom. Meaning, if you write:
>>> with CmpH5Reader("aligned_reads.cmp.h5") as r: ... print r[0].read()the CmpH5Reader object will be automatically closed after the block within the “with” statement is executed.
If you have an application that uses the CmpH5Reader and you want to start using BAM files, your best bet is to use the following generic factory functions:
Note
Since BAM files contain a subset of the information that was present in cmp.h5 files, you will need to provide these functions an indexed FASTA file for your reference. For full compatibility, you need the openIndexedAlignmentFile function, which requires the existence of a bam.pbi file (PacBio BAM index companion file).
The bas.h5/ bax.h5 file formats are container formats for PacBio reads, built on top of the HDF5 standard. Originally there was just one bas.h5, but eventually “multistreaming” came along and we had to split the file into three bax.h5 parts and one bas.h5 file containing pointers to the parts. Use BasH5Reader to read any kind of bas.h5 file, and BaxH5Reader to read a bax.h5.
Note
In contrast to GFF, for example, the bas.h5 read coordinate system is 0-based and start-inclusive/end-exclusive, i.e. the same convention as Python and the C++ STL.
The BAM format is a standard format described aligned and unaligned reads. PacBio is transitioning from the cmp.h5 format to the BAM format. For basic functionality, one should use BamReader; for full compatibility with the CmpH5Reader API (including alignment index functionality) one should use IndexedBamReader, which requires the auxiliary PacBio BAM index file (bam.pbi file).
The cmp.h5 file format is an alignment format built on top of the HDF5 standard. It is a simple container format for PacBio alignment records.
Note
In contrast to GFF, for example, all cmp.h5 coordinate systems (refererence, read) are 0-based and start-inclusive/end-exclusive, i.e. the same convention as Python and the C++ STL.
FASTA is a standard format for sequence data. We recommmend using the FastaTable class, which provides random access to indexed FASTA files (using the conventional SAMtools “fai” index).
FASTQ is a standard format for sequence data with associated quality scores.
The GFF format is an open and flexible standard for representing genomic features.