NEW: Features
1

Text Controls

XML as a Tool for Data Transferral

National Library of Medicine
MEDLARS Project



ATLIS has been a contractor to the National Library of Medicine since the mid-70s. The contract tasks entail the electronic capture of citation and abstract information from current printed biomedical literature. As part of the last contract negotiations, the deliverable output format changed from a proprietary card-image format to XML-encoded instances. The data is ultimately housed in an Oracle database, but NLM requires that each journal or monograph be delivered as a valid separate XML-encoded file.

ATLIS developed a methodology similar to that used in the Old MEDLINE project for the initial capture of the data. Using a short-form tagging structure keyed to an editing worksheet, the offshore data entry operators apply brief mnemonic tags to the relevant elements in the article.

Automated Processing

The data is transferred overnight via an FTP link directly into a "watched" FTP folder on ATLIS's server. As files are received, they are converted into XML-tagged instances, spell-checked, and a formatted print is spooled to a high-speed laser printer. If a file is found to have any errors during the conversion process, it is moved to a separate ERRORS folder for review and manual correction. All of this processing is unattended.

The proof prints are reviewed by ATLIS staff for content and spelling. Any errors that are found are corrected using an off-the-shelf XML editing package that ensures conformance to the DTD.

Validation Filters

Although XML allows the use of any entity tag for special characters as well as the entire UTF-8 universe, NLM has mandated that only a subset of accented letters is allowable for their data. The characters must be provided in their UTF-8 character sequence. In addition, blank lines and structured formatting are perfectly permissible in XML, but not acceptable to NLM's import programs. Because the XML editing software cannot be configured to disallow character sequences or formats that are otherwise valid to XML, ATLIS has developed a validation filter that will not allow a file to pass the logout portion of our procedures if any non-standard characters or invalid formatting is detected.