EP3347830A1 - Procédé d'établissement automatique de requêtes inter-langues pour moteur de recherche - Google Patents
Procédé d'établissement automatique de requêtes inter-langues pour moteur de rechercheInfo
- Publication number
- EP3347830A1 EP3347830A1 EP16766260.0A EP16766260A EP3347830A1 EP 3347830 A1 EP3347830 A1 EP 3347830A1 EP 16766260 A EP16766260 A EP 16766260A EP 3347830 A1 EP3347830 A1 EP 3347830A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- language
- word
- words
- target
- vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 239000013598 vector Substances 0.000 claims abstract description 129
- 238000012549 training Methods 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 claims description 55
- 238000004458 analytical method Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000013519 translation Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3337—Translation of the query language, e.g. Chinese to English
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/45—Example-based machine translation; Alignment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
Definitions
- the invention relates to the field of computer science applied to the language. More specifically, the invention relates to a method for automatically establishing cross-language query for search engine.
- a known method known as Skip-gram (Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, 2013a.) Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781) allows a learning of word vectors allowing the processing of a very large amount of data in a short time.
- the Skip-gram method allows you to process a set of 1 .6 billion words in less than a day.
- search engine queries based on word vectors can only be made in one language only.
- the aim of the invention is to make it possible to establish, from a query word, queries executable by a search engine in several languages.
- the invention proposes a method for automatically establishing inter-language requests executed by a search engine, characterized in that, from a text file containing a training corpus comprising a set of sentences correspondingly expressed in at least two languages, the words of each of the two languages being each associated with a target vector, said method comprises:
- said method further comprises: i - a step of determining M closest target vectors of said target vector associated with said query word ,
- the aforementioned steps i) to iii) are repeated until results returned by said search engine are free from the meaning of the query word to be filtered.
- the retrenchment step is performed by applying the Gram-Schmidt ortho-normalization process.
- each word of said training corpus being associated with a target vector and a context vector
- the step of aligning the target vectors comprises:
- steps for calculating cost functions called intra-language functions, for calculating the target vectors and the context vectors in each of the two languages
- steps of calculating cost functions called interleaved cost functions, respectively for aligning the target vectors of the words of a first language with respect to the context vectors of the words of a second language, as well as for aligning the target vectors with words of the second language with respect to the context vectors of the words of the first language, and
- the step of calculating each intra-language cost function is performed by an iterative method implementing a sliding window in said training corpus and based on the analysis of a target vector of a word of interest of the window relative to the context vectors of the other words of the window, so-called context words, located around the word of interest and expressed in the same language as the word of interest.
- the intra-language cost function implemented in the Skip-Gram method is expressed as follows:
- [w-1: w + 1] is the window of words corresponding to a sentence of the training corpus centered around the word of interest w
- the steps of calculating the inter-language cost functions of one language with respect to another language are performed by an iterative method implementing a sliding window in the training corpus and based on the analysis of a target vector of a word of interest of the window relative to the context vectors of all the words in the window, including the word of interest, expressed in a language different from that of the word d interest.
- the inter-language cost function is expressed as follows:
- said method further comprises:
- the invention also relates to computer equipment of the computer or server type comprising a memory storing instructions software allowing the implementation of the method as previously defined.
- FIG. 1 shows a diagram of the various steps of the automatic inter-language request establishment method according to the present invention
- Figure 2 shows a diagram of the steps implemented to determine the aligned target vectors of words in two different languages
- FIG. 3 is a table illustrating the query words that can be generated, thanks to the method according to the present invention, in 21 languages from a target vector associated with a single query word;
- FIG. 4 is a table illustrating the possibility of disambiguating a query word having several meanings by subtracting a target vector associated with a word from another language corresponding to the direction to be filtered.
- the method according to the present invention is implemented from a text file containing a learning corpus C comprising a set of sentences correspondingly expressed in at least two languages, for example the English language "e” and the French language "f".
- the words of each of the two languages are each associated with a target vector w and a context vector c.
- the target vectors w and of context c each comprise a number of components of between 50 and 1000 and equal for example 300.
- the method comprises, in a first step, a step of determining 100 target vectors w aligned with words in both languages, so that two target vectors w associated with two corresponding words in the two languages are closest to each other.
- a target vector alignment step 100 once the target vector alignment step 100 has been performed, for a target vector associated with a word in a given first language, there is no other target vector that is closer than that associated with the translation of the target vector. word in the other language.
- steps 201, 202 for calculating cost functions I, Jf called intra-language cost functions are performed to calculate the target vectors w and the vectors of context c in each of the two languages.
- an intra-language cost function Je for the English language and an intra-language cost function Jf for the French language are thus calculated.
- each intra-language cost function Je, Jf are performed by an iterative method implementing a sliding window in the learning corpus C and based on the analysis of a vector target w of a word of interest of the window with respect to the context vectors c of the other words of the window, called context words, located around the word of interest and expressed in the same language as the word of interest .
- the word of interest is not taken into account when calculating the target vectors of the context words.
- w being the target vector of the word of interest, c corresponding to the context vector of the context word,
- calculation steps 203, 204 of cost functions Qe, f, Qf, and inter-language cost functions are performed respectively for aligning the target vectors w e with words of the first language e with respect to the vectors.
- context context cf words of the second language f as well as for aligning the target vectors w ⁇ f of the words of the second language f with respect to the context vectors Ce of the words of the first language e.
- the calculation step 203, 204 of each inter-language cost function Qe, f, Qf, e of one language with respect to another is performed by an iterative method implementing a sliding window in the corpus of learning C and based on the analysis of a target vector W of a word of interest of the window with respect to the context vectors c of the set of words located in the window and expressed in the different language from that of the word of interest.
- inter-language cost function ⁇ is expressed as follows:
- e is the target vector of the word of interest, corresponding to the context vector in the language other than that of the word of interest,
- the cost functions mentioned above will calculate the intra- language cost function.
- Ji as well as inter-language cost functions Qi, e and Qe, i.
- the invention will be able to easily align target vectors / in more than 15 different languages.
- N words are recovered in each of the languages considered having target vectors w closest to each other. relative to a target vector ⁇ associated with a query word.
- the determination of the closest target vectors w is performed by minimizing the Euclidean distance between the vectors.
- a step 103 the requests are then established and executed by a search engine from the N words previously retrieved in the languages in question.
- the method also implements a step 104 of displaying the results returned by the search engine.
- Figure 3 thus highlights that from a single query word, here the word "innovation”, it is possible to search using 10 words per language having vectors closest to the vector associated with the word "innovation”, ie a search based on 210 search words in the case of the use of 21 languages.
- the invention thus makes it possible to obtain search results in relation to the global meaning of a word considered in a plurality of languages, and this without necessarily having knowledge of the different languages because of the use of the target vectors aligned in the languages. different languages.
- the method may also further include:
- This retrenchment step is preferably performed by applying the Gram-Schmidt ortho-normalization process.
- Figure 4 shows the list of Polish words with the closest target vectors of the French word "train” accompanied by their translation into English.
- This list includes notions of vehicle, as well as temporal notions (eg being eating).
- the table shows that, if we subtract the target vector from the Italian word "sta” associated only with the temporal notion to the target vector of the word "train” in French, we obtain a list of Polish words containing only words related to the notion of vehicle.
- the subtraction between target vectors in the different languages eliminates one or more senses of a query word that the user wants to filter during his search to disambiguate a term.
- the aforementioned steps i) to iii) may be repeated by the user or automatically until results displayed by the search engine are free from the meaning of the query word to be filtered.
- the invention also relates to computer equipment of the computer or server type comprising a memory storing software instructions for implementing the method as previously described.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR1558249A FR3040808B1 (fr) | 2015-09-07 | 2015-09-07 | Procede d'etablissement automatique de requetes inter-langues pour moteur de recherche |
PCT/EP2016/070971 WO2017042161A1 (fr) | 2015-09-07 | 2016-09-06 | Procédé d'établissement automatique de requêtes inter-langues pour moteur de recherche |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3347830A1 true EP3347830A1 (fr) | 2018-07-18 |
Family
ID=55542737
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP16766260.0A Pending EP3347830A1 (fr) | 2015-09-07 | 2016-09-06 | Procédé d'établissement automatique de requêtes inter-langues pour moteur de recherche |
Country Status (4)
Country | Link |
---|---|
US (1) | US11055370B2 (fr) |
EP (1) | EP3347830A1 (fr) |
FR (1) | FR3040808B1 (fr) |
WO (1) | WO2017042161A1 (fr) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6705506B2 (ja) * | 2016-10-04 | 2020-06-03 | 富士通株式会社 | 学習プログラム、情報処理装置および学習方法 |
US11100117B2 (en) * | 2019-06-14 | 2021-08-24 | Airbnb, Inc. | Search result optimization using machine learning models |
US11354513B2 (en) * | 2020-02-06 | 2022-06-07 | Adobe Inc. | Automated identification of concept labels for a text fragment |
US11416684B2 (en) | 2020-02-06 | 2022-08-16 | Adobe Inc. | Automated identification of concept labels for a set of documents |
CN113779205B (zh) * | 2020-09-03 | 2024-05-24 | 北京沃东天骏信息技术有限公司 | 一种智能应答方法和装置 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7251637B1 (en) * | 1993-09-20 | 2007-07-31 | Fair Isaac Corporation | Context vector generation and retrieval |
JP2001043236A (ja) * | 1999-07-30 | 2001-02-16 | Matsushita Electric Ind Co Ltd | 類似語抽出方法、文書検索方法及びこれらに用いる装置 |
US8051061B2 (en) * | 2007-07-20 | 2011-11-01 | Microsoft Corporation | Cross-lingual query suggestion |
US9430563B2 (en) * | 2012-02-02 | 2016-08-30 | Xerox Corporation | Document processing employing probabilistic topic modeling of documents represented as text words transformed to a continuous space |
WO2015029241A1 (fr) * | 2013-08-27 | 2015-03-05 | Nec Corporation | Procédé d'acquisition de traduction de mot |
CN104731771A (zh) * | 2015-03-27 | 2015-06-24 | 大连理工大学 | 一种基于词向量的缩写词歧义消除系统及方法 |
-
2015
- 2015-09-07 FR FR1558249A patent/FR3040808B1/fr active Active
-
2016
- 2016-09-06 US US15/757,649 patent/US11055370B2/en active Active
- 2016-09-06 EP EP16766260.0A patent/EP3347830A1/fr active Pending
- 2016-09-06 WO PCT/EP2016/070971 patent/WO2017042161A1/fr unknown
Also Published As
Publication number | Publication date |
---|---|
FR3040808B1 (fr) | 2022-07-15 |
WO2017042161A1 (fr) | 2017-03-16 |
FR3040808A1 (fr) | 2017-03-10 |
US11055370B2 (en) | 2021-07-06 |
US20190026371A1 (en) | 2019-01-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3347830A1 (fr) | Procédé d'établissement automatique de requêtes inter-langues pour moteur de recherche | |
CN107402913B (zh) | 先行词的确定方法和装置 | |
CN110377740B (zh) | 情感极性分析方法、装置、电子设备及存储介质 | |
US9727556B2 (en) | Summarization of a document | |
KR20160060247A (ko) | 자연어 질의응답 시스템과 방법 및 패러프라이즈 모듈 | |
FR2977343A1 (fr) | Syteme de traduction adapte a la traduction de requetes via un cadre de reclassement | |
FR2821186A1 (fr) | Dispositif d'extraction d'informations d'un texte a base de connaissances | |
US11200283B2 (en) | Cohesive visualized instruction representation | |
EP2126735B1 (fr) | Procédé de traduction automatique | |
US20140330792A1 (en) | Application of text analytics to determine provenance of an object | |
CN109670080A (zh) | 一种影视标签的确定方法、装置、设备及存储介质 | |
WO2007116042A1 (fr) | Procede de de-doublonnage rapide d'un ensemble de documents ou d'un ensemble de donnees contenues dans un fichier | |
Paul et al. | An affix removal stemmer for natural language text in nepali | |
US9817808B2 (en) | Translation using related term pairs | |
WO2016116459A1 (fr) | Procédé de lemmatisation, dispositif et programme correspondant | |
Ingason et al. | Context-sensitive spelling correction and rich morphology | |
US9652450B1 (en) | Rule-based syntactic approach to claim boundary detection in complex sentences | |
Trang-Trung et al. | Lifelog Moment Retrieval with Self-Attention based Joint Embedding Model. | |
CN111949767A (zh) | 一种文本关键词的查找方法、装置、设备和存储介质 | |
Tongjing et al. | Intercity relationships between 293 Chinese cities quantified based on toponym co-occurrence | |
Feriel et al. | Automatic extraction of spatio-temporal information from Arabic text documents | |
FR2975553A1 (fr) | Aide a la recherche de contenus videos sur un reseau de communication | |
EP4155967A1 (fr) | Procédé d'échanges d'informations sur un objet d'intérêt entre une première et une deuxième entités, dispositif électronique d'échange d'informations et produit programme d'ordinateur associés | |
CN108376178B (zh) | 一种异常访谈记录文本的确定方法及装置 | |
WO2015132342A1 (fr) | Procédé d'analyse d'une pluralité de messages, produit programme d'ordinateur et dispositif associés |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20180214 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20210205 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: DASSAULT SYSTEMES |
|
APBK | Appeal reference recorded |
Free format text: ORIGINAL CODE: EPIDOSNREFNE |
|
APBN | Date of receipt of notice of appeal recorded |
Free format text: ORIGINAL CODE: EPIDOSNNOA2E |
|
APBR | Date of receipt of statement of grounds of appeal recorded |
Free format text: ORIGINAL CODE: EPIDOSNNOA3E |
|
APAF | Appeal reference modified |
Free format text: ORIGINAL CODE: EPIDOSCREFNE |