XML as a Tool for Data Conversion
National Library of Medicine
Old MEDLINE Project
In 2000, ATLIS was awarded a contract with the National Library of Medicine to electronically capture biomedical citation information that was only available in 8,000 pages of printed copies of Index Medicus dating from the late 50s. The data was provided as printed pages and ATLIS was tasked with keying and structuring the data so that it would ultimately reside as card-image records in an electronic repository based on NLM's specifications.
There were two types of information, one a listing of articles chronologically arranged by journal name. Each article had a supposedly unique identifier that was printed with the text of the article. The second type of information was a multi-level subject index that referred to the articles by their i.d. number. Part of the task was to link the two types of information into one record, including the subject headings.
To make everything more complex: 1) the hard copy manuscript was often damaged and portions were illegible; and b) the identifiers had been applied by hand so that there was no guarantee of an accurate match.
Unconventional Data Conversion
It was clear to us from the beginning that we would need the data adequately fielded so that we could generate multiple QC and QA reports as well as process the data electronically for export into the card image format. And it was also clear that, given the source material, the initial capture of the data would and could not be clean. We recognized that it would require a fairly high level of manual cleanup that would most efficiently be performed prior to import into a relational structure.
The approach we took to converting the data was a bit unconventional. We started by having the data keyed offshore. The data was entered into normal text files with a minimal set of tagging information to identify the elements.
ATLIS wrote two fairly simple DTDs that reflected the structure of each type of information. We also wrote a translation filter to convert the minimal tags to XML instances. As batches of electronic data were delivered, a simple filter changed the minimal tagging into XML codes. At that point we were able to use a standrd text editor and XML parsing tools to validate the data.
Moving XML into a Relational Database
Once the data was in a standardized format XML format, a program was developed to read the XML and import the data into the relational database. Since the program only had to deal with valid XML there was very little programming logic required for the conversion since XML-aware tools handled most of the burden.
After the validated, i.e. parsed, data had been imported into the database, various validation programs were run on it to identify content anomalies and errors. Using simple text editors we were able to quickly correct the input XML-tagged data. The corrected data was then iteratively re-imported until it passed all of our quality assurance procedures.
The Final Product
After all the data had been validated, the XML data was discarded. It could be recreated at will from the database. In other words it was now a two-way filter that would allow us to import and export at will. A second export routine into the format required by NLM was developed and that provided the final deliverable.



