Abstract:
Data analysis requires data to be of a high quality. Unfortunately this is not always the
case, especially when data is extracted from di erent data sources. In the case where
there is no unique identi er to match data records from multiple data sources alternative
methods need to be developed to match the records. Record linkage attempts to do this
primarily with deterministic and probabilistic approaches. Deterministic models depend
on certain corresponding elds from each record pair to be identical matches to match
the record pair together. Probabilistic methods use a set of equations called the Fellegi-
Sunter formulae to calculate decision-making weights, which is used to score a record pair
on how well they match. If the matching score is above a certain threshold, the record
pair is considered to be a match. This project investigates whether the development of a
learning algorithm that re nes the weights will improve the probabilistic model's matching
accuracy. The dataset that was used to train and test the record linkage models was a set
of 92650 record pairs, some of which were matches and some of which were non-matches. It
was found that a learning algorithm did improve the matching accuracy of the probabilistic
model, although it is likely that the increase in the number of input features will improve
the matching performance even more.