Following is how the LORIA (Lorraine Research Laboratory in Computer Science
and its Applications), a joint department of INRIA, CNRS and the three Nancy
universities, became a national and now international reference site in the
field natural language computer processing, and especially in standardization,
which is the work topic of a dozen researchers.
Linguistic computing was invented practically at the same time as computer science.
Right from the start, the idea appeared of using computers for automatic translation,
a feat however still out of reach today. The deployment of the Internet and
the multiplication of electronic documents then only increased opportunities
and needs in indexing, classification, text search and dialog transcription
in any language and independently of their evolution over time.
The whole difficulty is to achieve generic solutions that are applicable on
an international scale, and that can be specifically parametrized for a given
language or a special need. In order to get an idea of the complexity of this
task, it is enough to take the example of a lexicon (a dictionary) and the notion
of adjective with all possible declension forms: in French, you will have the
“masculine” and “feminine” genders and the “singular”
and “plural” numbers, whereas in Japanese you will also have to
consider the possibility of negation and tense agreement. It thus appeared clearly
since the beginning of the 1990s that the only solution to achieve a perpetual
management of the world's linguistic resources was to go through standardization.
A way of speaking the same language
The first international standard in the field concerned terminology, that is
to say the vocabulary that is specific to this and that industry, science or
institution. The ISO, the International Organization for Standardization, has
been thinking about this problem since its inception in 1947, within a dedicated
technical committee (TC 37). Indeed, by definition, any standard uses a specific
terminology and its multiple translations. It is also easy to understand the
interest of such an approach in view of the 360.000 pages of European Community
institutions texts that now have to be translated into 25 languages... a titanic
task.
“The ISO called on us in 2000 on the basis of our work in linguistic modeling,
in order to try and find a standardization solution in terminology,” says
Laurent Romary, Head of project Langue et Dialogue at LORIA. “At the time,
two standards were in competition, an American and a European one. Using our
more abstract approach, we were able to unify the two within a common specification
platform, which is now a reference.”Laurent Romary is the editor of this
standard (ISO 16642, or TMF for Terminological Markup Framework) that was published
in 2003, only three years after his first dealings with the ISO.
The interest of a standardized platform is that it makes it possible for a given
user, say a company, to create their own terminological format, specific to
their activity, and to exchange documents on this basis with for example subcontractors,
service providers or clients who adopt the same format—in a sense it provides
a way of speaking the same language. The ISO 16642 standard was very successful
and has already been implemented many times, including by IBM and by Daimler-Benz.
A white paper published in 2005 describes how to use it with concrete application
recommendations. This work will result from a collaboration between LORIA, such
industry partners as EDF and EADS and institutional partners such as the INIST
(Institute for Scientific and Technical Information, the CNRS information center),
in the framework of a national INRIA research and development initiative called
SYNTAX
Dealing with large volumes of electronic documents
Another international organization took an interest in INRIA work, the TEI (Text
Encoding Initiative) consortium. The TEI is a gathering of international institutional
partners interested in managing large quantities of electronic archives. The
consortium was created in 1987 for purposes of defining perpetual text formats
for libraries, universities, museums, publishers and so on. Given the quality
of the work carried out by the consortium, INRIA first took inspiration from
its representation directives to publish its own written documents or dialog
transcriptions. “We progressively integrated our own tools to annotate
the texts and prioritize the information, etc.” says Laurent Romary. “Our
work then attracted interest on the part of the consortium and in 2000 we were
asked to participate in its Scientific Board, which we still do today.”
What is even more rewarding is that the LORIA and two CNRS units, the INIST
and the ATILF (Computer Analysis and Processing of the French Language), constitute
one of the four TEI host sites, next to the University of Virginia and the University
of Providence in the United States, and Oxford University in Great Britain.
Nancy research scientists will bring their skills in data modeling to define
more generic text formats. On the national scale, this collaboration also makes
it possible to deploy TEI standards in specific contexts, for example to standardize
gray literature (scientific production, activity reports, doctoral dissertations,
and so on).
Standardizing lexicons and contents
There is thus nothing surprising for a LORIA researcher, Laurent Romary, to
be called on to chair the new ISO subcommittee dedicated to the standardization
of linguistic resources that was founded in 2002. Its objective is to standardize
all the information necessary for linguistic engineering, for example for orthographic
or grammatical correcting, automatic translation, information extraction, etc.
In this framework, a team associating INRIA (Gil Francopoulo) and the U.S. Department
of Defense is developing a standardized platform called LMF (Lexical Markup
Framework), this time to represent wide spectrum lexicons rather than terminologies.
They plan to co-publish a standard that models the representations associated
with words. The work involves in France a national network of more than fifty
industry and institutional contributors.
The subcommittee is also working on another standard format to represent contents
called MLIF (Multilingual Information Framework). The standard would be adapted
for example to translation memories (typical phrases created by translators
when frequently occurring), the localization of certain key messages in software,
DVD subtitling, among others.