We have been witnessing a meticulous expansion in the amount of biological databases as an outcome of the human genome-sequencing project. These biological databases are created and updated by the inventions of new molecules by the biologists. The nature of most of these databases is either non-structured or semi-structured. The data are stored in a flat file, which makes it difficult to retrieve a particular record in reasonable time. Computational biology tasks such as multiple sequence alignment, sequence similarity, motif finding, and structure prediction have yanked many researchers. We feel that computational biologists are many a time not interested in all the fields present in the database. Rather, they are concerned about particular fields depending upon the issue being addressed. We have developed utilities to extract and index UniRef100 database for fast sequential and indexed random access, to normalize occurrences of pairs, trios and quads substrings of amino acids in the database, a programmatically mutated database to test the sequence similarity algorithms. This work shall aid the upcoming researchers in the field of computational biology to customize existing database for the algorithmic needs to accelerate the operations.
Index Terms—UniRef100 protein database, customized database, substring frequency, sequence similarity, structure prediction. (Year:2010)
|
|