WO2004013766A1 - Générateur d'un corpus spécifique à un domaine - Google Patents
Générateur d'un corpus spécifique à un domaine Download PDFInfo
- Publication number
- WO2004013766A1 WO2004013766A1 PCT/EP2003/050315 EP0350315W WO2004013766A1 WO 2004013766 A1 WO2004013766 A1 WO 2004013766A1 EP 0350315 W EP0350315 W EP 0350315W WO 2004013766 A1 WO2004013766 A1 WO 2004013766A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- rank
- distance
- word
- words
- sentence
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
Definitions
- the present invention belongs to the field of automatic natural language processing. More particularly, it addresses the problem of generating a set of texts specific to a specific field of application or corpus.
- Specific corpora are necessary, in particular in computerized speech recognition systems to reach an acceptable recognition rate for the user. This is particularly necessary for systems with a large vocabulary (typically 20,000 words).
- the generation of a specific corpus is a processing which still requires, in the state of the art, long progressive learning operations based on a selection of the texts of a corpus of a given field from the sentences entered by the 'user.
- This method has the disadvantage of requiring long and costly user interactions.
- the present invention overcomes this drawback by allowing the user to formulate a specification of the specific corpus in the form of a grammar, or set of syntax rules, specific to the application domain, the generation of the specific corpus being then automatic.
- the invention thus significantly reduces the time required to collect the specific corpus.
- the invention provides a product / program and a method of collecting a set of texts specific to a field of application from a set of texts not specific, characterized in that it comprises a command module by a grammar of the field of application.
- Figure 1 Diagram showing the modules of the product / program according to the invention.
- Figure 2 Diagram showing an example of generation of n-grams of a specific grammar of an application domain.
- Figure 3 Diagram explaining an algorithm for calculating the distances between words according to the invention.
- Figure 4 Example of calculating the distances between words according to the invention.
- Figure 6 Graph showing the distribution of n-grams as a function of the distance to words of the grammar vocabulary.
- Figure 1 shows the sequence of modules and treatments according to the invention. The definitions in the figure are as follows:
- the general corpus 10 is a set of texts, commercially available, not specific to a field, which can contain several million texts.
- n-gram [V C oRPUs] 13 is a set of sequences of ordered words extracted from the general corpus or n-tuples, said words being present in the vocabulary. The manner in which these n-grams are made up is described below.
- the vocabulary of this corpus VCORPU S 1 1 is the set of words most frequently encountered in this corpus or set of monograms. Vocabulary is generally limited to 20,000 words.
- the AEF generator 20 is a module which makes it possible to generate the n-grams of a grammar of the domain ⁇ from said grammar, in a manner also explained in the following description.
- a set n- grams [Vc FG ( ⁇ )] 33 is generated from the grammar CFG ( ⁇ ) 30 in a manner explained in the following description.
- the specific corpus of ⁇ , CORPUS ( ⁇ ) 40 is initialized with the n-grams V C F G ( ⁇ ) 33.
- CORPUS ( ⁇ ) 40 we add the n-grams of VCORPUS 1 which fulfill the condition:
- ⁇ is the distance threshold which must be adjusted so as to optimize the constitution of CORPUS ( ⁇ ) 40 for specific recognition applications in the ⁇ domain.
- n-grams of VCORPUS13 will be bi-grams or tri-grams.
- a biogram is a set of two words which belong to the vocabulary V CORPUS I 1 with which are associated their probabilities of occurrence in the general corpus 10.
- Tri-grams are sets of three words in the order in which they appear in general corpus 10 with which their probabilities of occurrence are associated in general corpus 10.
- V C oRPus To generate n-grams [V C oRPus] one can use commercial tools generally designated under the generic name of tools for generating statistical language models. We can for example use the one developed by Carnegie Mellon University described by Philippe Clarkson and Ronald Rosenfeld in a University publication [Rosenfeld 95] Rosenfeld R., The CMU Statistical Language Modeling Toolkit and its use, ARPA Spoken Language Technology Workshop , Austin Texas (USA) pp 45-50, 1995. This article is incorporated by reference into the present description. Most statistical language models, and in particular the one described in the article under reference, correct the lowest occurrence probabilities so as to eliminate them. the bias which is classic in this type of statistical analysis. The least observed n-grams have a probability of occurrence biased downwards and the most observed a probability of occurrence biased upwards.
- the grammar CFG ( ⁇ ) 30 is a context-independent grammar, which means that variations in the context do not modify the grammar itself. This grammar is, in the state of the art, created manually.
- the n-grams [V C FG ( ⁇ )] 33 will typically be tri-grams or quadri-grams. They are created by the AEF generator 20, an example of which is described in FIG. 2. The generation of the n-grams of CFG ( ⁇ ) 30 takes place as follows,
- GRAMMAR unit (alpha OR bravo) (join OR (go to) unit
- V C FG (unit, alpha, bravo, join, go to, unit).
- VCGF I 6
- the uni-grams are: unit, alpha, bravo, join, go to, unity (we fall back on VCFG) - Bigrams are: alpha unit, bravo unit, alpha join, alpha go to, bravo join, bravo go to , join the unit, go to the unit, the alpha unit, the bravo unit.
- the vocabulary VCF G ( ⁇ ) 31 is the set of uni-grams.
- FIG. 3 illustrates the operation of the algorithm for calculating the distance between two words of a dictionary. In the application we use the three dictionaries 10, 12 and 32 of Figure 1.
- the dico-VcoRP U s 12 and dico-VcFG ( ⁇ ) 32 dictionaries are dictionaries extracted from a general dictionary 10a which is a commercially available component.
- This general dictionary provides information on inflected forms of words, such as pronunciation, the root of the word.
- semantic information which can be represented in the form of a graph or conceptual vectors. This algorithm has three steps:
- the editing distance returns the minimum number of editing operations necessary to transform the word a into word b.
- These editing operations are generally the insertion of a letter, the deletion of a letter and the substitution of a letter.
- D (a, b) be the function which returns the editing distance (Levenstein) which makes it possible to transform a into b.
- D ⁇ m a x be the maximum distance between any two words.
- a ⁇ VCFG ( ⁇ ) and be VCORPUS be the two words whose distance we want to measure.
- Any distance calculation function between a and b can be used. It is however preferable that the function D is continuous in pieces and increasing as a function of DQ.
- An example of the algorithm for calculating the distance between words is given below.
- Vcorpus defined by VCORPUS ⁇ , "show”, "horse” ⁇
- the shortest distance is that for the couple (display, show). It is indeed easier to insert than to delete: deletion leads to the loss of information, while insertion adds noise to the information.
- semantic calculation is made from semantic dictionaries.
- semantic dictionaries There are several forms of semantic dictionaries, two of which in particular: those based on graphs, and those based on vectors. On the example of colors, if the semantic dictionary is a graph, we can obtain the diagram of figure 5;
- the distance between colors and red is 2.
- the distance between red and green is 1.
- the distances are integer values, which makes it easier to build analysis tables that will allow you to choose the threshold.
- M (i, j) min M (i, j - 1) + D ( ⁇ , yj)
- the distance between two n-grams uses the distance of
- M 5 advance The distance between these two sentences is equal to the distance between the sequences M 1 M 2 M 3 M 4 and M 1 M2M 3 M 5 , given the unit distance matrix D (Mj, M j ) calculated previously .
- the implementation of the invention is possible on a commercial computer, of any type provided with conventional interfaces for input and restitution of data (keyboard, mouse, screen, printer). Integration with a voice recognition system is possible on a common configuration.
- the computer system also has a microphone, speakers, a specialized signal processing card and specialized voice recognition software.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2003262524A AU2003262524A1 (en) | 2002-07-26 | 2003-07-16 | Generator of a corpus specific to a field |
EP03766404A EP1540512A1 (fr) | 2002-07-26 | 2003-07-16 | Generateur d'un corpus specifique a un domaine |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR0209531 | 2002-07-26 | ||
FR0209531A FR2842923B1 (fr) | 2002-07-26 | 2002-07-26 | Generateur d'un corpus specifique a un domaine |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2004013766A1 true WO2004013766A1 (fr) | 2004-02-12 |
Family
ID=30011528
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2003/050315 WO2004013766A1 (fr) | 2002-07-26 | 2003-07-16 | Générateur d'un corpus spécifique à un domaine |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1540512A1 (fr) |
AU (1) | AU2003262524A1 (fr) |
FR (1) | FR2842923B1 (fr) |
WO (1) | WO2004013766A1 (fr) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5444617A (en) * | 1992-12-17 | 1995-08-22 | International Business Machines Corporation | Method and apparatus for adaptively generating field of application dependent language models for use in intelligent systems |
US5995918A (en) * | 1997-09-17 | 1999-11-30 | Unisys Corporation | System and method for creating a language grammar using a spreadsheet or table interface |
EP1100075A1 (fr) * | 1999-11-11 | 2001-05-16 | Deutsche Thomson-Brandt Gmbh | Procédé de construction d'un dispositif de reconnaissance de la parole |
-
2002
- 2002-07-26 FR FR0209531A patent/FR2842923B1/fr not_active Expired - Lifetime
-
2003
- 2003-07-16 EP EP03766404A patent/EP1540512A1/fr not_active Withdrawn
- 2003-07-16 AU AU2003262524A patent/AU2003262524A1/en not_active Abandoned
- 2003-07-16 WO PCT/EP2003/050315 patent/WO2004013766A1/fr not_active Application Discontinuation
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5444617A (en) * | 1992-12-17 | 1995-08-22 | International Business Machines Corporation | Method and apparatus for adaptively generating field of application dependent language models for use in intelligent systems |
US5995918A (en) * | 1997-09-17 | 1999-11-30 | Unisys Corporation | System and method for creating a language grammar using a spreadsheet or table interface |
EP1100075A1 (fr) * | 1999-11-11 | 2001-05-16 | Deutsche Thomson-Brandt Gmbh | Procédé de construction d'un dispositif de reconnaissance de la parole |
Non-Patent Citations (1)
Title |
---|
TAZINE, C.: "Création automatique de modèle de langage n-grammes depuis Internet par un mesure de distance", TALN, CORPUS ET WEB 2002, XP002238230, Retrieved from the Internet <URL:http://www-lli.univ-paris13.fr/colloques/tcw2002/node16.html> [retrieved on 20030414] * |
Also Published As
Publication number | Publication date |
---|---|
FR2842923B1 (fr) | 2004-09-24 |
FR2842923A1 (fr) | 2004-01-30 |
EP1540512A1 (fr) | 2005-06-15 |
AU2003262524A1 (en) | 2004-02-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10665226B2 (en) | System and method for data-driven socially customized models for language generation | |
US9886432B2 (en) | Parsimonious handling of word inflection via categorical stem + suffix N-gram language models | |
EP1836651B1 (fr) | Procédé de recherche, reconnaissance et localisation d'un terme dans l'encre, dispositif, programme d'ordinateur correspondants | |
US10176168B2 (en) | Statistical machine translation based search query spelling correction | |
WO2016045465A1 (fr) | Procédé de présentation d'informations sur la base d'une entrée et système associé au procédé d'entrée | |
FR2896603A1 (fr) | Procede et dispositif pour extraire des informations et les transformer en donnees qualitatives d'un document textuel | |
FR2963841A1 (fr) | Systeme de traduction combinant des modeles hierarchiques et bases sur des phases | |
WO2017161899A1 (fr) | Procédé, dispositif et appareil informatique de traitement de texte | |
WO2002067142A2 (fr) | Dispositif d'extraction d'informations d'un texte a base de connaissances | |
CN112256822A (zh) | 文本搜索方法、装置、计算机设备和存储介质 | |
FR3007164A1 (fr) | Procede de classification thematique automatique d'un fichier de texte numerique | |
WO2022183923A1 (fr) | Procédé et appareil de génération d'expressions, et support de stockage lisible par ordinateur | |
US20120239382A1 (en) | Recommendation method and recommender computer system using dynamic language model | |
US8069032B2 (en) | Lightweight windowing method for screening harvested data for novelty | |
Beaufays et al. | Language model capitalization | |
EP1540512A1 (fr) | Generateur d'un corpus specifique a un domaine | |
FR3031823A1 (fr) | Lemmatisateur semantique base sur des dictionnaires ontologiques. | |
FR3030809A1 (fr) | Procede d'analyse automatique de la qualite litteraire d'un texte | |
WO2013117872A1 (fr) | Procede d'identification d'un ensemble de phrases d'un document numerique, procede de generation d'un document numerique, dispositif associe | |
FR2880708A1 (fr) | Procede de recherche dans l'encre par conversion dynamique de requete. | |
Dwivedi et al. | Neural Machine Translation and Detailed Analysis of Impact of Pre-Processing Techniques | |
FR3116355A1 (fr) | Détection d’au moins un thème partagé par une pluralité de documents textuels | |
Djerrad | Analyse des sentiments des tweets liés au Hirak | |
EP4155967A1 (fr) | Procédé d'échanges d'informations sur un objet d'intérêt entre une première et une deuxième entités, dispositif électronique d'échange d'informations et produit programme d'ordinateur associés | |
FR3077148A1 (fr) | Procede et dispositif electronique de selection d'au moins un message parmi un ensemble de plusieurs messages, programme d'ordinateur associe |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
REEP | Request for entry into the european phase |
Ref document number: 2003766404 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2003766404 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2003766404 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: JP |