The Intelligent Systems Engineering Laboratory (ISEL) University of Ulster

Project Proposals for William Flynn Scholarship: IIndex page

Project Number: 5
Project Title: Database Design in Bioinformatics
Project Supervisor: Mr John Mc Gregor plus another member of staff

The development of fully automated DNA sequencing technologies has resulted in an information deluge in the field of molecular biology. The so-called 'sequence structure deficit' - the exponentially increasing available of sequence data which far outstrips the resultant information relating to actual 3 dimensional structures - represents a very significant problem (arguably getting worse) which can only be addressed using ever more sophisticated computing technology. This general field has become known as Bioinformatics.

The formulation of hypotheses as to (for example) protein structure depends on ready access to sequence data presented in a useful way. Two distinct analytical approaches are now common. The first approach involves the use of pattern recognition techniques to detect similarities between sequences and thus to deduce related structure and function; the second attempts direct predictions from the linear structure to deduce 3D structure and infer function.

Both of these approaches, and others, require sensible data organisation. As more is discovered about sequence data, and the resulting genetic information encapsulated therein, this in turn has consequences for the retooling of the underlying structure of the database design. In addition, as geneticists and other researchers refine the nature of their interactions with the data, this too has implications for database design.

As well as databases of sequence data there have evolved secondary databases containing metadata. These have arisen because within multiple alignments it has been found regions of data which have little variation between constituent sequences. These regions constitute identifying motifs having some specific biological function and which can be classified. The structures of these databases have evolved in markedly different directions, posing yet further computational challenges.

The ever increasing quantities of data, and the wide variety of data enquiry systems and algorithmic techniques, has inevitably dictated that the access to and delivery of information is carried out using the internet. Web technology, distributed database systems, object orientation, and intelligent interfaces represent just some of the areas to be looked at, linked by the common theme of enhancing understanding of genetic sequence data.
Hand in hand with database issues will come new algorithms and methods for computational biology, especially those aimed at addressing efficiency, scalability, and cost issues associated with high-performance computing. Areas such as sequence analysis, structure and function prediction, neural information theory, whole genome analysis, pharmacgenomics, expression microarrays, large structure and in-vivo imaging, will benefit as the appropriate database design becomes better understood.
This project will examine database structures as they are used in these large datasets and attempt (based on analysis and experiment) to propose some future sensible developments.

[1] Attwood, TK, Parry-Smith, DJ, Introduction to Bioinformatics, Prentice Hall, 1999.
[2] Schulze-Kremer, S, Molecular Bioinformatics: Algorithms and Applications, Walter de Gruyter, 1996.

If you are interested in being considered for a studentship please contact
the Group Director, Professor T.M. McGinnity by email:
tm.mcginnity@ulst.ac.uk

or telephone: +44-(0)28-71375417.

See the current research section of this website for details on research projects pursued by existing PhD students