In this study, the regular expressions forming abstract states in Brzozowski’s algorithm are not remapped to sequential state transition table addresses as would be the case in the classical approach, but are hashed to integers. Two regular expressions that are hashed to the same hash code are assigned the same integer address in the state transition table, reducing the number of states in the automaton. This reduction does not necessarily lead to the construction of a minimal automaton: no restrictions are placed on the hash function hashing two regular expressions to the same code. Depending on the quality of the hash function, a super-automaton, previously referred to as an approximate automaton, or an exact automaton can be constructed. When two regular expressions are hashed to the same state, and they do not represent the same regular language, a super-automaton is constructed. A super-automaton accepts the regular language of the input regular expression, in addition to some extra strings. If the hash function is bad, many regular expressions that do not represent the same regular language will be hashed together, resulting in a smaller automaton that accepts extra strings. In the ideal case, two regular expressions will only be hashed together when they represent the same regular language. In this case, an exact minimal automaton will be constructed. It is shown that, using the hashing approach, an exact or super-automaton is always constructed. Another outcome of the hashing approach is that a non-deterministic automaton may be constructed. A new version of the hashing version of Brzozowski’s algorithm is put forward which constructs a deterministic automaton. A method is also put forward for measuring the difference between an exact and a super-automaton: this takes the form of the k-equivalence measure: the k-equivalence measure measures the number of characters up to which the strings of two regular expressions are equal. The better the hash function, the higher the value of k, up to the point where the hash function results in regular expressions being hashed together if and only if they have the same regular language. Using the k-equivalence measure, eight generated hash functions and one hand coded hash function are evaluated for a large number of short regular expressions, which are generated using G¨odel numbers. The k-equivalence concept is extended to the average k-equivalence value in order to evaluate the hash functions for longer regular expressions. The hand coded hash function is found to produce good results. Copyright
Dissertation (MEng)--University of Pretoria, 2010.