RU2008142648A

RU2008142648A - METHOD FOR AUTOMATED SEMANTIC TEXT INDEXATION IN NATURAL LANGUAGE, METHOD FOR AUTOMATED SEMANTIC TEXT INDEXATION IN NATURAL LANGUAGE AND MACHINE READABLE WEAR

Info

Publication number: RU2008142648A
Application number: RU2008142648/12A
Authority: RU
Inventors: Владимир Фёдорович Хорошевский (RU); Владимир Фёдорович Хорошевский; Виктор Петрович Клинцов (RU); Виктор Петрович Клинцов
Original assignee: Закрытое акционерное общество "Авикомп Сервисез" (RU); Закрытое акционерное общество "Авикомп Сервисез"
Priority date: 2008-10-29
Filing date: 2008-10-29
Publication date: 2010-05-10
Also published as: RU2399959C2; EP2350871A1; WO2010050844A1

Abstract

1. Способ автоматизированной семантической индексации текста на естественном языке, содержащий этапы, на которых: ! представляют индексируемый текст в электронной форме для последующей автоматической и (или) автоматизированной обработки; ! сегментируют текст в электронной форме на элементарные единицы, именуемые далее токенами; ! выявляют в тексте, в процессе лингвистического анализа, устойчивые словосочетания; ! формируют предложения, соответствующие участкам текста; ! выявляют в каждом предложении с выявленными словосочетаниями, в процессе многоступенчатого семантико-синтаксического анализа с помощью обращения к сформированным в базе данных лингвистическим и эвристическим правилам в заранее заданной лингвистической среде, именуемым далее правилами, семантически значимые объекты, именуемые далее именованными сущностями, и семантически значимые отношения между именованными сущностями, именуемые далее именованными отношениями; ! формируют в пределах индексируемого текста для каждого из выявленных именованных отношений, связывающих определенные именованные сущности, множество триад, причем единственная триада первого типа соответствует связи, устанавливаемой именованным отношением между двумя именованными сущностями, каждая из триад второго типа соответствует значению конкретного атрибута одной из этих сущностей, а каждая из триад третьего типа соответствует значению конкретного атрибута самого именованного отношения; ! индексируют на множестве сформированных триад все связанные именованными отношениями именованные объекты по отдельности, все пары вида «именованная сущность - именов� 1. A method of automated semantic indexing of text in a natural language, containing stages in which:! submit indexed text in electronic form for subsequent automatic and (or) automated processing; ! segment text in electronic form into elementary units, hereinafter referred to as tokens; ! identify in the text, in the process of linguistic analysis, stable phrases; ! form sentences corresponding to sections of the text; ! identify in each sentence with identified phrases, in the process of multi-stage semantic-syntactic analysis by using the linguistic and heuristic rules generated in the database in a predefined linguistic environment, hereinafter referred to as rules, semantically significant objects, hereinafter referred to as named entities, and semantically significant relations between named entities, hereinafter referred to as named relationships; ! form within the indexed text for each of the identified named relationships that bind certain named entities, many triads, and the only triad of the first type corresponds to the relationship established by the named relationship between two named entities, each of the triads of the second type corresponds to the value of a specific attribute of one of these entities, and each of the triads of the third type corresponds to the value of a specific attribute of the named relationship itself; ! index on a set of triads formed all associated named objects individually, all pairs of the form “named entity - name�

Claims

1. A method of automated semantic indexing of text in a natural language, containing stages in which:

submit indexed text in electronic form for subsequent automatic and (or) automated processing;

segment text in electronic form into elementary units, hereinafter referred to as tokens;

identify in the text, in the process of linguistic analysis, stable phrases;

form sentences corresponding to sections of the text;

identify in each sentence with identified phrases, in the process of multi-stage semantic-syntactic analysis by using the linguistic and heuristic rules generated in the database in a predefined linguistic environment, hereinafter referred to as rules, semantically significant objects, hereinafter referred to as named entities, and semantically significant relations between named entities, hereinafter referred to as named relationships;

form within the indexed text for each of the identified named relationships that bind certain named entities, many triads, and the only triad of the first type corresponds to the relationship established by the named relationship between two named entities, each of the triads of the second type corresponds to the value of a specific attribute of one of these entities, and each of the triads of the third type corresponds to the value of a specific attribute of the named relationship itself;

on a set of triads formed, all named objects connected separately by named relations, all pairs of the type “named entity - named relation” and all triads of the type “named entity - named relation - named entity” are indexed, taking into account the attributes of the corresponding named entities and (or) named relations ;

save the generated triads and the resulting indices in the database together with a link to the source text from which these triads are formed.

2. The method according to claim 1, in which the aforementioned tokens, hereinafter referred to as elementary units of the first level, are selected from the group consisting of words in the form of sequences of letters or letters and hyphens, numbers, punctuation marks and sequences of spaces.

3. The method according to claim 1, in which corresponding elementary units of the second level, hereinafter referred to as morphs, are formed for each token representing a word, based on morphological analysis.

4. The method according to claim 1, in which, in the process of the mentioned linguistic analysis, when forming phrases, sequences of elementary units of the first and (or) second levels (ie tokens and morphs) are transformed in each sentence by accessing dictionaries stored in the database and morphological relationships in the mentioned phrases, hereinafter referred to as elementary units of the third level.

5. The method according to claim 1, in which in the process of the aforementioned multi-stage semantic-syntactic analysis, the steps are performed in which:

said named objects, considered to be elementary units of the fourth level, are identified in the sentence on the set of elementary units of the first, second and (or) third levels;

form using the above-mentioned rules for each named entity morphological attributes from the morphological attributes of elementary units of the second and (or) third levels that make up this named entity;

form using the above-mentioned rules for each named entity semantic attributes from the attributes of elementary units of the second and (or) third levels that make up this named entity;

assign to each named entity the corresponding type from the subject ontology stored in the database on the subject of each subject area to which the indexed text belongs;

they store in memory each named entity together with the type assigned to it and the morphological and semantic attributes found for it.

6. The method according to claim 5, in which, for each named entity with the type assigned to it, the corresponding anaphore link is considered to be an elementary unit of the fifth level, and it is stored in the database along with the type and attributes of the named entity, which is the antecedent of this anaphore link indicating the identity by reference between this named entity and its anaphoric reference;

said named relations, considered to be elementary units of the sixth level, are found using said rules on the basis of elementary units of the first, second, third, fourth and (or) fifth levels;

using the above-mentioned rules, find morphological attributes from the components of the given named relation of elementary units of the second level for each named relation;

using the above-mentioned rules, they find semantic attributes from elementary units of the first, second, third and (or) fourth levels for each named relation;

assign to each named relation the corresponding type from the subject ontology stored in the database on the subject of the subject area to which the indexed text belongs;

each named relation is stored in memory together with the type assigned to it and the morphological and semantic attributes found for it.

7. The method according to claim 1, in which before storing the generated triads and obtained indices in the database, each group of objects related by the relations of identity by reference is convolved into a single object, the set of attributes of which is a combination of attributes of objects of this group connected by identity relations by reference.

8. A method for automated semantic indexing of a collection of texts in natural language, containing all the steps of the method according to claim 1 as applied to the next indexed text, after which, when the generated triads and the obtained indices of the next text are stored in the database, they are compared using the ones generated in the database linguistic and heuristic rules in a predefined linguistic environment, newly identified named objects and named relations with named objects already in the database by named relations and, if identical named objects and (or) named relations are identified, duplicate information is not stored in the database, and links to the next texts in which they are present are added to the corresponding named objects and (or) named relations, and links to text fragments within each of the next texts from which they are selected.

9. The method of claim 8, in which the aforementioned tokens, hereinafter referred to as elementary units of the first level, are selected from the group consisting of words in the form of sequences of letters or letters and hyphens, numbers, punctuation marks and sequences of spaces.

10. The method of claim 8, in which the corresponding elementary units of the second level, hereinafter referred to as morphs, are formed for each token representing a word, based on morphological analysis.

11. The method according to claim 8, in which, in the process of said linguistic analysis, when generating phrases, sequences of elementary units of the first and (or) second levels (ie, tokens and morphs) are transformed in each sentence by accessing dictionaries stored in the database and morphological relationships in the mentioned phrases, hereinafter referred to as elementary units of the third level.

12. The method according to claim 8, in which in the process of the aforementioned multi-stage semantic-syntactic analysis, the steps are performed in which:

13. The method according to p. 12, in which for each named entity with the assigned type the corresponding anaphore link is considered to be an elementary unit of the fifth level, and it is stored in the database along with the type and attributes of the named entity, which is the antecedent of this anaphore link indicating the identity by reference between this named entity and its anaphoric reference;

14. The method according to claim 8, in which before storing the generated triads and the resulting indices in the database, each group of objects connected by the relations of identity by reference is convolved into a single object, the set of attributes of which is a combination of attributes of objects of this group connected by identity relations by reference.

15. Machine-readable medium intended for direct participation in the computer and containing a program for implementing the method according to claim 1.

16. Machine-readable medium intended for direct participation in the computer and containing a program for implementing the method of claim 8.