logo inria

News
INRIA
Scientific Research
Valorization and Transfer
Publications and Documentation
Working and Training at INRIA

 directory site map
 advanced search and help

Information de meme niveau :

Protocols | Security and mobility | Satellites links | Sending data massively | Wireless networks | Web | Natural languages | Software ergonomics | New usages | Scientific calculation | Algorithms | Smart cards |

-----------------------
International standardization recognition in the field of natural languages
-----------------------

Following is how the LORIA (Lorraine Research Laboratory in Computer Science and its Applications), a joint department of INRIA, CNRS and the three Nancy universities, became a national and now international reference site in the field natural language computer processing, and especially in standardization, which is the work topic of a dozen researchers.

Linguistic computing was invented practically at the same time as computer science. Right from the start, the idea appeared of using computers for automatic translation, a feat however still out of reach today. The deployment of the Internet and the multiplication of electronic documents then only increased opportunities and needs in indexing, classification, text search and dialog transcription in any language and independently of their evolution over time.

The whole difficulty is to achieve generic solutions that are applicable on an international scale, and that can be specifically parametrized for a given language or a special need. In order to get an idea of the complexity of this task, it is enough to take the example of a lexicon (a dictionary) and the notion of adjective with all possible declension forms: in French, you will have the “masculine” and “feminine” genders and the “singular” and “plural” numbers, whereas in Japanese you will also have to consider the possibility of negation and tense agreement. It thus appeared clearly since the beginning of the 1990s that the only solution to achieve a perpetual management of the world's linguistic resources was to go through standardization.

A way of speaking the same language
The first international standard in the field concerned terminology, that is to say the vocabulary that is specific to this and that industry, science or institution. The ISO, the International Organization for Standardization, has been thinking about this problem since its inception in 1947, within a dedicated technical committee (TC 37). Indeed, by definition, any standard uses a specific terminology and its multiple translations. It is also easy to understand the interest of such an approach in view of the 360.000 pages of European Community institutions texts that now have to be translated into 25 languages... a titanic task.
“The ISO called on us in 2000 on the basis of our work in linguistic modeling, in order to try and find a standardization solution in terminology,” says Laurent Romary, Head of project Langue et Dialogue at LORIA. “At the time, two standards were in competition, an American and a European one. Using our more abstract approach, we were able to unify the two within a common specification platform, which is now a reference.”Laurent Romary is the editor of this standard (ISO 16642, or TMF for Terminological Markup Framework) that was published in 2003, only three years after his first dealings with the ISO.

The interest of a standardized platform is that it makes it possible for a given user, say a company, to create their own terminological format, specific to their activity, and to exchange documents on this basis with for example subcontractors, service providers or clients who adopt the same format—in a sense it provides a way of speaking the same language. The ISO 16642 standard was very successful and has already been implemented many times, including by IBM and by Daimler-Benz. A white paper published in 2005 describes how to use it with concrete application recommendations. This work will result from a collaboration between LORIA, such industry partners as EDF and EADS and institutional partners such as the INIST (Institute for Scientific and Technical Information, the CNRS information center), in the framework of a national INRIA research and development initiative called SYNTAX

Dealing with large volumes of electronic documents
Another international organization took an interest in INRIA work, the TEI (Text Encoding Initiative) consortium. The TEI is a gathering of international institutional partners interested in managing large quantities of electronic archives. The consortium was created in 1987 for purposes of defining perpetual text formats for libraries, universities, museums, publishers and so on. Given the quality of the work carried out by the consortium, INRIA first took inspiration from its representation directives to publish its own written documents or dialog transcriptions. “We progressively integrated our own tools to annotate the texts and prioritize the information, etc.” says Laurent Romary. “Our work then attracted interest on the part of the consortium and in 2000 we were asked to participate in its Scientific Board, which we still do today.”
What is even more rewarding is that the LORIA and two CNRS units, the INIST and the ATILF (Computer Analysis and Processing of the French Language), constitute one of the four TEI host sites, next to the University of Virginia and the University of Providence in the United States, and Oxford University in Great Britain. Nancy research scientists will bring their skills in data modeling to define more generic text formats. On the national scale, this collaboration also makes it possible to deploy TEI standards in specific contexts, for example to standardize gray literature (scientific production, activity reports, doctoral dissertations, and so on).

Standardizing lexicons and contents
There is thus nothing surprising for a LORIA researcher, Laurent Romary, to be called on to chair the new ISO subcommittee dedicated to the standardization of linguistic resources that was founded in 2002. Its objective is to standardize all the information necessary for linguistic engineering, for example for orthographic or grammatical correcting, automatic translation, information extraction, etc. In this framework, a team associating INRIA (Gil Francopoulo) and the U.S. Department of Defense is developing a standardized platform called LMF (Lexical Markup Framework), this time to represent wide spectrum lexicons rather than terminologies. They plan to co-publish a standard that models the representations associated with words. The work involves in France a national network of more than fifty industry and institutional contributors.
The subcommittee is also working on another standard format to represent contents called MLIF (Multilingual Information Framework). The standard would be adapted for example to translation memories (typical phrases created by translators when frequently occurring), the localization of certain key messages in software, DVD subtitling, among others.

--------------------------------
back to top    |next complexity of new usages     | home page Valorization
© INRIA - updated 08/29/2006 - dri-webmaster@inria.fr