MOTIVATION : The recognition and normalization of cell line names in text is an important task in biomedical
text mining research, facilitating for instance the identification of synthetically lethal genes
from the literature. While several tools have previously been developed to address cell line recognition,
it is unclear whether available systems can perform sufficiently well in realistic and broadcoverage
applications such as extracting synthetically lethal genes from the cancer literature. In
this study, we revisit the cell line name recognition task, evaluating both available systems and
newly introduced methods on various resources to obtain a reliable tagger not tied to any specific
subdomain. In support of this task, we introduce two text collections manually annotated for cell
line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus.
RESULTS : We find that the best performance is achieved using NERsuite, a machine learning system
based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary
of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98%
on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated
articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755
unique cell line database identifiers.