UNIVERSITY OF HERTFORDSHIRE COMPUTER SCIENCE RESEARCH COLLOQUIUM "DISCRIMINATING CODING, NON-CODING AND REGULATORY REGIONS IN GENOMES USING RESCALED RANGE AND DETRENDED FLUCTUATION ANALYSIS" Dr. Rene te Boekhorst School of Computer Science University of Hertfordshire 7 December 2005 (Wednesday) Lecture Theatre E350 Hatfield, College Lane Campus 3 - 4 pm Coffee/tea and biscuits will be available. [Catering Permitting] Everyone is Welcome to Attend [Space Permitting] Abstract: In this talk I will explain some of the methods we use to characterize and differentiate two important constituents of the genome: regulatory regions and exons. Exons are those sequences of nucleotides (nucleotides are the building blocks of the DNA molecule) that are involved in linking up amino-acids, which in turn are the fundamental components of proteins. Together with pieces of DNA that separate exons from each other ('introns'), exons form genes. Proteins play various roles in the life of organisms. They may have structural functions (such as keratin, the building material of nails and hair) or act as regulators. Apart from their (auto-) catalytic role as enzymes in metabolism and development, a very important function of some regulatory proteins is the control of gene expression. A particular class of these expression regulating proteins, the so-called transcription factors, currently attracts much research interest. By binding to certain regions of the genome (the transcription factor binding sites) these proteins determine which genes are activated or inhibited and at what time. Transcription factor binding sites are clustered in so-called regulatory regions and much of current bioinformatics research is devoted to the development of tools that recognize these regions. The algorithms applied to identifying the position and function of genomic elements can broadly be categorized as supervised and non-supervised methods. Supervised methods are based on a training set of data (nucleotide sequences of known functional elements, such as exons or regulatory regions) from which machine learning methods are to abstract recognizable patterns. Non-supervised techniques start without such knowledge, but use for instance the over- (or under-) representation of certain sequences to infer about their possible function. The methods we have been working on are a little bit of both. We used sequences of known regulatory regions, exons and non-regulatory regions (not exons) to search for statistical properties that discriminate between these three types of genomic elements. The properties I will focus on are entropy and sequential `persistence' (long range dependency in the sequence of nucleotides) and I will show statistical evidence for the differentiation these measurements allow for. If time permits, I will also discuss how these statistics could be used in a sliding window technique (in which the size of the window adapts to the local structure of the DNA) to localize exons and binding sites. This is joint work of the speaker with Irina Abnizova (MRC-BSU Cambridge) and Chrystopher Nehaniv (Computer Science, University of Hertfordshire). -- Hertfordshire Computer Science Research Colloquium http://homepages.feis.herts.ac.uk/~nehaniv/colloq