Abstract:
This work is a significant extension of earlier research that was conducted in fulfilment of the requirements of an MSc degree in Computer Science at the University of Pretoria. The MSc research verified the hypothesis:
Finite automata can detect microsatellites effectively in deoxyribonucleic acid (DNA).
The purpose of this thesis is to extend the above hypothesis to minisatellites and satellites in particular with regards to accuracy. Microsatellites, minisatellites and satellites are subsets of tandem repeats (TRs). TRs refer to consecutive motifs contained by genomic sequences. A perfect TR is defined as a string of nucleotides which is consecutively repeated at least twice. An approximate TR is a string of nucleotides repeated at least twice, allowing for some differences between the instances.
Recent articles dealing with TR detection report that current implemented TR detector algorithms are not exhaustive. Consequently TRs from several TR detectors should be combined to ensure reliable TR annotation.
In order to verify the hypothesis Counting Finite Automata can detect minisatellites and satellites accurately in DNA four theoretically related algorithms, collectively referred to as FireSat, are proposed. Of these FireSat1, FireSat2 and FireSat3 have been implemented. There is reported on the latter whenever FireSat is compared against rivalry software. The underlying principles of FireSat2' have been implemented illustrating that it would detect the exact same TRs as FireSat3.
Although FireSat has default values, the user has several parameters that can be used to fine-tune her search. The objective being to allow the user to do an exhaustive search within the statistical constraints of her problem domain.
During the endeavour to develop an algorithm, using finite automata for minisatellite and satellite detection in DNA, new types of automata were discovered (coined counting automata) that have the following properties:
• They can be used for tandem repeat detection to the same degree of accuracy as conventional finite automata.
• The languages they define can be classified in the Chomsky hierarchy.
• They can be implemented more efficiently than other finite automata, in terms of both space and time.
Furthermore Counting Prototype Finite Automata (CPFAs) are defined. CPFAs are cascaded to determine a Levenshtein based distance between two genetic strings with less states than the traditional Levenshtein Distance Automaton. Within CPFAs deterministic state reduction and simplification enable the separation of mutation types. Respective CPFAs cater for small substrings of lengths between 1 and 4. Memory usage is decreased in such a manner that a field-programmable gate array implementation becomes viable.