WO2010050844A1 - Procédé d’indexation sémantique informatisée de texte en langage naturel, procédé d’indexation sémantique informatisée de collection de textes en langage naturel, et supports lisibles par machine - Google Patents

Procédé d’indexation sémantique informatisée de texte en langage naturel, procédé d’indexation sémantique informatisée de collection de textes en langage naturel, et supports lisibles par machine Download PDF

Info

Publication number
WO2010050844A1
WO2010050844A1 PCT/RU2009/000111 RU2009000111W WO2010050844A1 WO 2010050844 A1 WO2010050844 A1 WO 2010050844A1 RU 2009000111 W RU2009000111 W RU 2009000111W WO 2010050844 A1 WO2010050844 A1 WO 2010050844A1
Authority
WO
WIPO (PCT)
Prior art keywords
named
text
relations
relation
attributes
Prior art date
Application number
PCT/RU2009/000111
Other languages
English (en)
Inventor
Vladimir Fyodorovich Khoroshevsky
Victor Petrovich Klintsov
Original Assignee
Zakrytoe Aktsionernoe Obschestvo "Avicomp Services"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zakrytoe Aktsionernoe Obschestvo "Avicomp Services" filed Critical Zakrytoe Aktsionernoe Obschestvo "Avicomp Services"
Priority to EP09823885A priority Critical patent/EP2350871A1/fr
Publication of WO2010050844A1 publication Critical patent/WO2010050844A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present invention relates to the information technologies field, namely, to methods of computerized semantic indexing of natural language texts, as well as to machine-readable media comprising respective programs, and could be used for ordering and accumulating information in specified knowledge areas for the purpose of semantic navigation through the documents and document collections, as well as for the highly-precise and quick search of facts and documents relevant to the user's information needs.
  • the EAPO Patent No. 002016 (published on 2001.01.22) describes a method, where unique information blocks are detected in text document fragments and used for subsequent processing and searching.
  • the RU Patent No. 2268488 (published on 2006.01.20) granted on the basis of the PCT Application published as WO 01/06414 discloses the method wherein words, stable phrases, idioms, sentences, and even ideas are coded for the subsequent processing at the numerical level.
  • the RU Patent No. 2273879 (published on 2006.04.10) adduces a method wherein morphological and syntactic text analysis with the subsequent indexing of the detected units.
  • a text similarity is determined by text fragments.
  • the disadvantage of all those methods consists in that they do not take into account the semantic ambiguity of the natural language words and expressions.
  • the US Patent No. 6,189,002 discloses a method, wherein a text is divided into paragraphs and words that are converted into vectors of the ordered element. Each vector element corresponds to the paragraph determined by applying the predetermined function to a number of occurrences of the word corresponding to that element, in this paragraph.
  • the text vector is considered as the semantic profile of the document.
  • this method requires an enormous massive of the stored data and does not distinguish semantic ambiguity of the words and expressions.
  • the object of the present invention consists in extending the set of methods for indexing the natural languages texts by means of employing techniques of the computerized linguistic analysis thereof and further usage of obtained results for building indices, which ensures the semantic navigation through documents and document collections, as well as the highly-precise and quick search of facts and documents relevant to the user's information needs, particularly, in reference to the high-inflectional language texts.
  • Fig. 1 depicts the schematic block-diagram explaining the claimed methods
  • Fig. 2 depicts the fragment of the application domain specification
  • Fig. 3 shows the rule schema for extracting the named entities of the type of "Person"
  • Fig. 4 shows the rule schema for extracting the semantically meaningful relationships of the type of "work"
  • Fig. 5 depicts the fragment of the graphical form for representing the results of the text processing
  • Fig. 6 shows the general diagram for storing the results of processing one text
  • Fig. 7 shows the diagram of the left-hand side of the rule for combining the named entities of the type of "Person" .
  • the proposed methods allow for performing effectively the conceptual indexing of the natural language texts both for the subsequent semantic navigation through the documents and document collections and for search purposes.
  • the method of computerized semantic indexing of natural language text according to the first aspect of the present invention and the method of computerized semantic indexing of collection of natural language text according to the second aspect of the present invention could be implemented in practically ei- ther computing environment, e.g., in a personal computer connected to external databases.
  • the steps of performing those methods are illustrated in Fig. 1.
  • a token could be any text object from the following set: words consisting each of the series of letters and, possibly, hyphens; a series of spaces; punctuation marks; numbers. Sometimes, such character series as A300, il50b, etc. are also pertained hereinto. Tokens' separation is always carried out in accordance with rather simple rules, for example, as in the mentioned US Patent Application No. 2007/0073533. In Fig. 1, this step is contingently marked with the reference number 2.
  • tokens are considered as the first level elementary units.
  • respective second level elementary units named hereinafter morphs are formed based on the morphological analysis.
  • its normalized word form is identified.
  • the normalized word form will be «H ⁇ TH»
  • the normalized word form will be « ⁇ pacHBoro»
  • the normalized word form will be « ⁇ pacnBBiH»
  • the normalized word form will be «c ⁇ eHa».
  • a part of speech to which this word relates and its morphological characteristics are indicated. Of course, for various parts of speech those characteristics differ.
  • RAM random access memory
  • the next step contingently marked with the reference number 3 in Fig. 1 consists in that stable phrases are identified in a set of derived elementary units of two first levels, tokens and morphs. This is performed by converting the ele- mentary units, i.e., tokens and morphs, into series that are compared with the series of normalized words and their characteristics in the dictionaries stored in advance in the databases, where words are adduced with specification of grammatical associations therebetween. Once coinciding the succeeding series being compared with the corresponding dictionary series, that succeeding series being compared is recognized as the stable phrase and stored in such a kind in the database as the third level elementary unit.
  • sentences corresponding to the portions of the text being indexed are formed.
  • these are real sentences ending with dot, but in some cases it is suitable for in- terpreting as a sentence some parts of usual sentences, say, isolated element in enumeration. Therefore, this step can issue sentences not always coincided with the sentences of the text being indexed in common sense.
  • this analysis is contingently divided into steps marked with the reference numbers 5 to 11.
  • Said multistage semantic-syntactic analysis is carried out by addressing the linguistic and heuristic rules formed in the database in the predetermined linguistic environment.
  • Such an environment could be, for example, the linguistic environment mentioned in the above RU Patent No. 2242048, or the environment disclosed in said US Patent Application No. 2007/0073533, or any other linguistic environment defining respective rules that allows to eliminate syntactic and semantic ambiguousness of the real text words and ex- pressions.
  • the linguistic and heuristic rules in the chosen environments are hereinafter referred to as rules.
  • the semantically meaningful objects hereinafter referred to as named entities (the reference number 5 in Fig. 1), and the attributes thereof (the reference numbers 7 and 9 in Fig. 1) are identified.
  • the identification of the named entities that are considered as the fourth level elementary units is carried out in the sentence in a set of elementary units of the first, second, and/or third levels.
  • the morphological attributes are formed for every named entity using said rules from the morphological attributes of those elementary units of the second and/or third levels (i.e., morphs and/or stable phrases) which constitute this named entity.
  • the semantic attributes are formed for every named entity using said rules from the semantic attributes of the elementary units of the second and/or third levels which constitute this named entity.
  • the step of forming said attributes is contingently marked with the reference 7.
  • for every named entity is assigned a respective type from the application ontology according to the topics of the application domain, to which the text being indexed relates.
  • the application ontology is meant, in this case, the specification of the particular application domain, which is stored in the respective database.
  • the corresponding anaphoric reference considered as the fifth level elementary unit (if any) is determined.
  • Every identified named entity is stored in the respective memory together with the type assigned thereto and morphologic and semantic attributes determined thereto.
  • the anaphoric reference is stored together with the type and attributes of the named entity which is the antecedent of that anaphoric reference, as well as with the indication of the co-reference between that named entity and the anaphoric reference thereof.
  • the semantically meaningful relations between the named entities hereinafter referred to as named relations are determined based on the elementary units of the first, second, third, fourth and/or fifth levels using said rules (the step 6).
  • the named relations can relate the named entities within both one sentence and the entire text being indexed.
  • the morphological attributes are determined for every named relation using said rules from the second level elementary units (i.e., morphs) constituting this relation, as well as the semantic attributes from the elementary units of the first, second, third and/or fourth levels, constituting this relation.
  • the respective type is assigned to every named relation from application ontology stored in the database according to the topics of the application domain, to which the text be- ing indexed relates. After that, every named relation is stored in the respective memory together with the type assigned thereto and morphologic and semantic attributes determined thereto.
  • the stored named entities and named relations are used for forming the triples.
  • a set of the triples of three types is formed within the text being indexed for every of the identified named relations relating the certain named entities.
  • the single first type triple corresponds to the relation established by the named relation between two named entities.
  • Each of the set of the second type triples corre- sponds to a value of particular attribute of one of those entities, and each of the set of the third type triples corresponds to a value of particular attribute of the named relation itself.
  • the first type triple could be represented (depicted) as O; — » Ry -> O j .
  • Each of the set of the second type tri- pies could be represented as Oj ->• A im — » V im or O j -> A jn -> V jn , where A im and A jn are respective attributes, Vj 1n and V jn are, respectively, values of those attributes.
  • each of the set of the third type triples could be represented as R y -> A 1 J k — > Vj jp , where A yk is a respective attribute, and V yP is a value of that attribute.
  • the indices i, j, k, m, n, and p are integers.
  • the triples formed at the step 12 and indices obtained at the step 13, together with the reference to the initial text from which those triples have been formed, are stored in the database (the step 15 in Fig. 1; the step 14 is omitted in this case).
  • a convolution is performed (not shown) for the objects related by co-reference relations into a single object whose set of the attributes are the combination of the attributes of all object interrelated by the co-reference relations. This is done in order for reducing the memory volume in the database required for storing such objects, as well as for integrating under one object the information obtained for the entire text.
  • the method of computerized semantic indexing of collection of natural language texts according to the second aspect of the present invention is carried out exactly as already discussed method of computerized semantic indexing of natural language texts according to the first aspect of the present invention, but in this case, after the step 13 of indexing and prior to the step 15 of storing in the database, one more step is performed. At this step marked with the reference number 14 in Fig. 1 and performed substantially simultaneously with the step 15, the following is carried out when storing in the database the formed triples and obtained semantic indices of the succeeding text.
  • the newly derived named entities and named relations are compared with the named entities and named relations already existed in the database using the linguistic and heuristic rules in the predetermined linguistic environment that are formed in the database.
  • the duplicated information is not stored in the database, and respective named entities and/or named relations are supplemented with references to the succeeding texts where they are present and references to the text fragments within each of succeeding texts from which they are derived.
  • the step of indexing the text collection is occurred similarly to the indexing the first text of this collection (or the first text indexed by this method), which permits to simplify significantly the entire indexing procedure, reduce the required memory volume, and integrate the information obtained from different texts within a single object.
  • a representative example of such text is the following message: «L ⁇ eHmpcuibHbi ⁇ ⁇ edep ⁇ ibHbi ⁇ 3a 26.06.08 K) ⁇ eH ⁇ o nopymui TuMotueHKo ⁇ u ⁇ edamb y Uymuua ijeny ⁇ a zo3 26.06.08 14:00 Kue ⁇ , HiOHb 26 (Ho ⁇ bi ⁇ Pezuo ⁇ , Muxaun P ⁇ o ⁇ ) — IIpe3udeHm V ⁇ pauHbi
  • the preliminary created application domain speci- fication is used, within which the text collection processing and semantic index constructing will be carried out.
  • a fragment of such specification is depicted in Fig. 2.
  • Such specifications are developed by human experts, who record, based on their experience and knowledge, a list of object types and a list of typical relations therebetween essential for this application domain.
  • the main types of objects are "Person”, “Organization”, “Location”, and some other.
  • the human experts build in advance also a set of rules, each rule containing, in the left-hand side, a template for searching examples of objects and/or examples of relations therebetween, and in the right-hand side, op- erators for fixing in the text the examples of objects and/or examples of relations therebetween determined in accordance with the template.
  • the specific data corresponding to the domain specification are derived in the texts being processed.
  • common and special lexicons are used.
  • the step of segmenting the text into ele- mentary units, tokens is performed with the morphological analysis of the token-words (reference number 2 in Fig. 1).
  • the initial text is transformed into a set of tokens and morphs that are represented in the Table 1 and Table 2, respectively.
  • the step is carried out for deriving the stable phrases (lookups) using the common and special lexicons (reference number 3 in Fig. 1).
  • the initial text is supplemented, besides the first and second level elementary units, with a set of the third level elementary units, lookups.
  • the fragment of this set for the above example is rep- resented in the Table 3.
  • the text being processed is fragmented into sentences (reference number 4 in Fig. 1).
  • the pluralities formed at the above steps are supplemented with a set of sentences, represented in the Table 4.
  • the text being processed will be segmented into the sentences, each of which is marked with a plurality of annotations of the first, second and third level elementary units.
  • the step of deriving the named entities is carried out at the set of the elementary units of the first, second and/or third levels using said rules.
  • the named entities «HOBBIH PeraoH» ["New Region”], «Cepret JlaBpoB» ["Sergey Lavrov”], «Y ⁇ paHHa» ["Ukraine”]
  • the pronouns are determined that could be anaphoric references to the corresponding named entities, and for those pronouns that are really such ones, the co-reference between the respective named entity and the anaphoric reference thereof (the fifth level elementary unit) is fixed.
  • the obtained set of the anaphoric references is represented in the Table 6.
  • the semantically meaningful relations between the named entities are determined using the rules.
  • the initial text will be marked with the set of annotations corresponding to the named entities with the attributes thereof and the named relations with the attributes thereof between the named entities.
  • the graphical representation of the text processing results is shown in Fig. 5.
  • the next step marked with the reference number 12 in Fig. 1 is a technical and carried out for performing the triples corresponding to the stored named entities and named relations.
  • the fragment of the set of such triples for the example under consideration is represented in the Table 8.
  • the formed set of triples contains the initial data for the semantic indexing of the text processed at the previous steps.
  • the semantic index is built as follows: first, from the set of the triples obtained at the previous step, the triple subsets are formed, each oh which subsets corresponds to one named entity with the attributes thereof, and every obtained triple subset is used as an entry for one of conventional indexers, for example, the well-known, freeware indexer Lucene, the indexer of the Yandex search machine, the Google indexer, or any other indexer, from which output an index unique for the given triple subset is obtained.
  • conventional indexers for example, the well-known, freeware indexer Lucene, the indexer of the Yandex search machine, the Google indexer, or any other indexer, from which output an index unique for the given triple subset is obtained.
  • the similar operation series is performed for all subsets of triples corresponding to the pairs of the kind "named entity — named relation” and to the triples of the kind of "named entity - named relation - named entity” taking into account the attributes of the respective named entities and/or named relations, thereby obtaining a set of the corresponding unique indices which constitute, in the aggregate, the semantic index of the text.
  • the fragment of the semantic index for the example under consideration is represented in the Tables 9 to l l.
  • a set of continuous chains of the triples for the relation "The_same" are formed.
  • the check is performed whether the set of continuous chains of the triples obtained at the previous step is empty. If that set is not empty, then, sequentially, at the next steps (53-56), the set of objects for the next chain is formed (53), this set is convolved into the single object (54) having the combined set of the attributes (without repetitions), the obtained single object is stored together with the at- tributes thereof (55), and the set of the processed objects of the succeeding chain is removed (56). But if at the step 52 the set of the triple chains turns out to be empty (initially or as a result of performing the steps 53 to 55), then, at the step
  • the formed overall set of the triples is supplemented with the semantic indi- ces and references to the initial text; after which, at the step 59, the supplemented set is stored in the database.
  • the processing of every subsequent text including its semantic index constructing is carried out by performing just the same steps as for the single text.
  • the step 13 of indexing and prior to the step 15 of storing in the database one more step marked with the reference number 14 in Fig. 1 is carried out, the step of combining the results of processing the succeeding text with the results of processing the previous texts stored already in the database, which step is carried out as fol- lows.
  • the named objects and named relations newly derived in the succeeding text being indexed are compared with the named objects and named relations already existed in the database by checking the coincidence of the semantic indices thereof, and, in the case of the positive result of such comparing, the respec- tive objects and relations are excluded from the following processing, herewith storing in the object and/or relation already existed in the database the reference to that text and that fragment of that text, where the object and/or relation excluded from the following processing are identified.
  • the similarity between the new objects and/or relations and those object and/or relations that are already exist in the database is identified using the linguistic and heuristic rules formed in advance in the database, and in the case of the positive result, the object and/or relation descriptions already exist in the database are widened with the new data, after which the existing semantic indices are reconstituted and the new semantic indices are added as the secondary ones to the already existing indices, and, moreover, in the object and/or relation already existing in the database, the reference is stored to that text and that fragment of this text, where the new objects and/or relations are identified, and then the respective objects and relations are ex- eluded from the following processing. Otherwise, the newly identified named objects and named relations together with the semantic indices thereof are added to the database.
  • the object will be identified, particularly, the object «Y ⁇ paHHa» ["Ukraine”], which semantic index fully coincides with the semantic index of the object «y ⁇ paHHa» ["Ukraine”] already existed in the database, and, moreover, the similarity will be identified (by applying the rule which diagram is shown in Fig.
  • the present invention provides for extending the set of methods for indexing the natural languages texts by means of employing techniques of the computerized linguistic analysis thereof and subsequent use of obtained results for building indices, the main difference of which methods from the known method of indexing consists in indexing semantically meaningful concepts and relations therebetween rather than the key words and lookups, which provides for the semantic navigation through the documents and documents collections, as well as the highly-precise and quick search of facts and documents, especially in reference to high-inflectional language texts.
  • Table 1 Results of tokenizing the example text

Abstract

La présente invention concerne le domaine des technologies de l’information, notamment, des procédés d’indexation sémantique informatisée de textes en langage naturel. L’utilisation de la présente invention permet l’extension de l’ensemble de procédés d’indexation de textes en langage naturel grâce à l’utilisation de techniques d’analyse linguistiques informatisée et l’utilisation également des résultats obtenus pour construire des index, qui assure la navigation sémantique à travers des documents et des collections de documents pertinents aux besoins d’information de l’utilisateur, en particulier, en référence aux textes en langues hautement inflexionnelles. Le procédé d’indexation sémantique informatisée de textes en langage naturel comprend les étapes suivantes : la segmentation du texte dans la forme électronique en symboles ; l’identification de locutions stables ; la formation de phrases ; par l’observation des règles linguistiques et heuristiques formées dans la base de données dans l’environnement linguistique prédéterminé, l’identification d’objets sémantiquement signifiants (entités nommées) et les relations sémantiquement signifiantes entre eux (relations nommées) ; pour chaque relation nommée, la formation d’un ensemble de triplets, où un triplet unique d’un premier type correspondant à la relation établie par la relation nommée entre deux entités nommées, chacun de l’ensemble de triplets d’un second type correspondant à une valeur d’attribut particulier d’une de ces entités, et chacun de l’ensemble de triplets d’un troisième type correspondant à une valeur d’attribut particulier de la relation nommée elle-même ; au niveau de l’ensemble des triplets formés, l’indexation de toutes les entités nommées associées par les relation nommées séparément, toutes les paires du type « entité nommée/relation nommée », et tous les triplets du type « entité nommée/relation nommée/entité nommée », en prenant en compte les attributs des entités nommées et/ou des relations nommées respectives ; et le stockage dans la base de données des triplets formés et des index obtenus avec la référence au texte de départ à partir duquel ces triplets ont été formés.
PCT/RU2009/000111 2008-10-29 2009-03-06 Procédé d’indexation sémantique informatisée de texte en langage naturel, procédé d’indexation sémantique informatisée de collection de textes en langage naturel, et supports lisibles par machine WO2010050844A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP09823885A EP2350871A1 (fr) 2008-10-29 2009-03-06 Procédé d indexation sémantique informatisée de texte en langage naturel, procédé d indexation sémantique informatisée de collection de textes en langage naturel, et supports lisibles par machine

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2008142648 2008-10-29
RU2008142648/12A RU2399959C2 (ru) 2008-10-29 2008-10-29 Способ автоматизированной обработки текста на естественном языке путем его семантической индексации, способ автоматизированной обработки коллекции текстов на естественном языке путем их семантической индексации и машиночитаемые носители

Publications (1)

Publication Number Publication Date
WO2010050844A1 true WO2010050844A1 (fr) 2010-05-06

Family

ID=42129031

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2009/000111 WO2010050844A1 (fr) 2008-10-29 2009-03-06 Procédé d’indexation sémantique informatisée de texte en langage naturel, procédé d’indexation sémantique informatisée de collection de textes en langage naturel, et supports lisibles par machine

Country Status (3)

Country Link
EP (1) EP2350871A1 (fr)
RU (1) RU2399959C2 (fr)
WO (1) WO2010050844A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130124964A1 (en) * 2011-11-10 2013-05-16 Microsoft Corporation Enrichment of named entities in documents via contextual attribute ranking
US8997008B2 (en) 2012-07-17 2015-03-31 Pelicans Networks Ltd. System and method for searching through a graphic user interface

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2452002C1 (ru) * 2011-03-04 2012-05-27 Сергей Иванович Колесник Способ создания многоязыкового автоматического индекса электронной цифровой лоции
RU2518946C1 (ru) * 2012-11-27 2014-06-10 Александр Александрович Харламов Способ автоматизированной семантической индексации текста на естественном языке
US9772995B2 (en) 2012-12-27 2017-09-26 Abbyy Development Llc Finding an appropriate meaning of an entry in a text
US20140188456A1 (en) * 2012-12-27 2014-07-03 Abbyy Development Llc Dictionary Markup System and Method
RU2538303C1 (ru) * 2013-08-07 2015-01-10 Александр Александрович Харламов Способ автоматизированного семантического сравнения текстов на естественном языке
RU2538304C1 (ru) * 2013-08-22 2015-01-10 Александр Александрович Харламов Способ автоматизированной семантической классификации текстов на естественном языке
RU2565473C2 (ru) * 2013-11-01 2015-10-20 Федеральное государственное бюджетное образовательное учреждение высшего профессионального образования "Российский государственный гуманитарный университет" (РГГУ) Метод построения корпуса текстов на основе интернет-форумов
RU2665239C2 (ru) * 2014-01-15 2018-08-28 Общество с ограниченной ответственностью "Аби Продакшн" Автоматическое извлечение именованных сущностей из текста
RU2544739C1 (ru) * 2014-03-25 2015-03-20 Игорь Петрович Рогачев Способ преобразования структурированного массива данных
EA201700031A1 (ru) * 2014-06-27 2017-05-31 Игорь Петрович РОГАЧЕВ Способ предварительного преобразования исходного массива данных, способ формирования карты связей компонентов частей логических конструкций преобразованного структурированного исходного массива данных, способы поиска в преобразованном массиве данных с использованием карты связей компонентов и системы и устройства для реализации этих способов
RU2618374C1 (ru) * 2015-11-05 2017-05-03 Общество с ограниченной ответственностью "Аби ИнфоПоиск" Выявление словосочетаний в текстах на естественном языке
CN107402912B (zh) * 2016-05-19 2019-12-31 北京京东尚科信息技术有限公司 解析语义的方法和装置
RU2619193C1 (ru) * 2016-06-17 2017-05-12 Общество с ограниченной ответственностью "Аби ИнфоПоиск" Многоэтапное распознавание именованных сущностей в текстах на естественном языке на основе морфологических и семантических признаков
RU2646386C1 (ru) * 2016-12-07 2018-03-02 Общество с ограниченной ответственностью "Аби Продакшн" Извлечение информации с использованием альтернативных вариантов семантико-синтаксического разбора
CN106933809A (zh) * 2017-03-27 2017-07-07 三角兽(北京)科技有限公司 信息处理装置及信息处理方法
CN107203511B (zh) * 2017-05-27 2020-07-17 中国矿业大学 一种基于神经网络概率消歧的网络文本命名实体识别方法
RU2713568C1 (ru) * 2019-11-10 2020-02-05 Игорь Петрович Рогачев Способ преобразования структурированного массива данных
RU2717718C1 (ru) * 2019-11-10 2020-03-25 Игорь Петрович Рогачев Способ преобразования структурированного массива данных, содержащего простые суждения
RU2717719C1 (ru) * 2019-11-10 2020-03-25 Игорь Петрович Рогачев Способ формирования структуры данных, содержащей простые суждения

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2273879C2 (ru) * 2002-05-28 2006-04-10 Владимир Владимирович Насыпный Способ синтеза самообучающейся системы извлечения знаний из текстовых документов для поисковых систем
US7191115B2 (en) * 2001-06-20 2007-03-13 Microsoft Corporation Statistical method and apparatus for learning translation relationships among words
US20070073533A1 (en) * 2005-09-23 2007-03-29 Fuji Xerox Co., Ltd. Systems and methods for structural indexing of natural language text
US7305336B2 (en) * 2002-08-30 2007-12-04 Fuji Xerox Co., Ltd. System and method for summarization combining natural language generation with structural analysis
US7346493B2 (en) * 2003-03-25 2008-03-18 Microsoft Corporation Linguistically informed statistical models of constituent structure for ordering in sentence realization for a natural language generation system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7191115B2 (en) * 2001-06-20 2007-03-13 Microsoft Corporation Statistical method and apparatus for learning translation relationships among words
RU2273879C2 (ru) * 2002-05-28 2006-04-10 Владимир Владимирович Насыпный Способ синтеза самообучающейся системы извлечения знаний из текстовых документов для поисковых систем
US7305336B2 (en) * 2002-08-30 2007-12-04 Fuji Xerox Co., Ltd. System and method for summarization combining natural language generation with structural analysis
US7346493B2 (en) * 2003-03-25 2008-03-18 Microsoft Corporation Linguistically informed statistical models of constituent structure for ordering in sentence realization for a natural language generation system
US20070073533A1 (en) * 2005-09-23 2007-03-29 Fuji Xerox Co., Ltd. Systems and methods for structural indexing of natural language text

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130124964A1 (en) * 2011-11-10 2013-05-16 Microsoft Corporation Enrichment of named entities in documents via contextual attribute ranking
US9552352B2 (en) * 2011-11-10 2017-01-24 Microsoft Technology Licensing, Llc Enrichment of named entities in documents via contextual attribute ranking
US8997008B2 (en) 2012-07-17 2015-03-31 Pelicans Networks Ltd. System and method for searching through a graphic user interface

Also Published As

Publication number Publication date
RU2399959C2 (ru) 2010-09-20
EP2350871A1 (fr) 2011-08-03
RU2008142648A (ru) 2010-05-10

Similar Documents

Publication Publication Date Title
WO2010050844A1 (fr) Procédé d’indexation sémantique informatisée de texte en langage naturel, procédé d’indexation sémantique informatisée de collection de textes en langage naturel, et supports lisibles par machine
Turmo et al. Adaptive information extraction
Bikel et al. An algorithm that learns what's in a name
Faure et al. First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX
US8374844B2 (en) Hybrid system for named entity resolution
US4868750A (en) Collocational grammar system
US20100332217A1 (en) Method for text improvement via linguistic abstractions
Neumann et al. A shallow text processing core engine
US20200311345A1 (en) System and method for language-independent contextual embedding
Feng et al. Probabilistic techniques for phrase extraction
Constant et al. Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields
Ferreira et al. A new sentence similarity assessment measure based on a three-layer sentence representation
Marciniak et al. Terminology extraction from medical texts in Polish
Zhang et al. Natural language generation and deep learning for intelligent building codes
Chen et al. Automated extraction of tree-adjoining grammars from treebanks
RU2563148C2 (ru) Система и метод семантического поиска
López-Hernández et al. Automatic spelling detection and correction in the medical domain: A systematic literature review
Spasić et al. Head to head: Semantic similarity of multi–word terms
Agbele et al. Context-aware stemming algorithm for semantically related root words
Panahandeh et al. Correction of spaces in Persian sentences for tokenization
Jafar Tafreshi et al. A novel approach to conditional random field-based named entity recognition using Persian specific features
Mekki et al. Tokenization of Tunisian Arabic: a comparison between three Machine Learning models
Zeller Detecting ambiguity in statutory texts
Bindu et al. Design and development of a named entity based question answering system for Malayalam language
DeVille et al. Text as Data: Computational Methods of Understanding Written Expression Using SAS

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09823885

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2009823885

Country of ref document: EP