EA002016B1

EA002016B1 - A method of searching for fragments with similar text and/or semantic contents in electronic documents stored on a data storage devices

Info

Publication number: EA002016B1
Application number: EA200100467A
Authority: EA
Inventors: Лев Лазаревич Матвеев; Александр Иванович Акимов
Original assignee: Лев Лазаревич Матвеев; Александр Иванович Акимов
Priority date: 2001-04-06
Filing date: 2001-04-06
Publication date: 2001-10-22
Also published as: EA200100467A1

Abstract

1. A method of searching for fragments with similar text and/or semantic content in electronic documents stored on data storage devices comprising: indexing each document which is saved in the archive, dividing said documents into fragments and forming subject groups containing one and more fragments, defining search parameters, conducting a search, ranking the list of document fragments received as a result of the search, wherein a set of unique information blocks appeared in the selected document fragment is defined as search parameters and said set is extended by preprocessing each of said unique information blocks, where a unique information block is an information block that appeared in the selected document fragment one or more times, where the preprocessing is an operation of obtaining from at least one unique information block one or more information blocks connected with said unique information block by a specified relation. 2. The method of claim 1, wherein an information block is a character string appeared in a document and limited by certain characters. 3. The method of claim 1, wherein a document fragment is any selected information block sequence in a document. 4. The method of claim 3, wherein a document fragment is a complete copy of a document. 5. The method of claim 3, wherein rules of dividing documents into fragments are specified. 6. The method of claim 5, wherein a document is divided into fragments using at least one rule. 7. The method of claim 5, wherein documents are automatically divided into fragments. 8. The method of claim 1, wherein a set of document fragments which are being searched for fragments with similar text and/or semantic content to the selected one is limited by indicating at least one rule which was used to divide documents into fragments and/or indicating a subject group. 9. The method of claim 1, wherein users form a subject group. 10. The method of claim 1, wherein a program module for processing documents is used to divide documents and their fragments into a set of information blocks. 11. The method of claim 1, wherein at least one operation of preprocessing is selected and users specify the order of preprocessing operations. 12. The method of claim 11, wherein the preprocessing consists of one logical operations of identity. 13. The method of claim 1, wherein after performing the operation of preprocessing, at least one initial block is removed or left in the query. 14. The method of claim 1, wherein all unique information blocks of the selected document fragment are included in a set of unique information blocks defined as search parameters. 15. The method of claim 1, a set of unique information blocks, which is defined as search parameters, is formed one or more functions. 16. The method of claim 15, a function of making a list of unique information blocks is used for the selected fragment. 17. The method of claim 15, wherein it is used a function of determining the number of occurrences of information blocks obtained after preprocessing unique information blocks in the text content of fragments of the selected subject group. 18. The method of claim 15, wherein it is used a function of determining the frequency of occurrences of unique information blocks in the text content of fragments of the selected subject group, where the frequency of occurrences is expressed in percents from the number of occurrences of the most frequently appeared unique information block in the text content of the selected subject group. 19. The method of claim 1, wherein a set of unique information blocks which is defined as search parameters includes a specified number of unique information blocks from the set which has been created according to user-defined rules. 20. The method of claim 1, wherein the search for document fragments is carried out on local and/or remote data storage devices. 21. The method of claim 20, wherein a remote device is any information resource or information retrieval system that operates in a computer network and provides in response to a search query a list of document fragments relevant to search parameters. 22. The method of claim 1, wherein the list of document fragments received as a result of search is ranked according to the number of information block groups appeared in found document fragments which combine information blocks obtained after preprocessing the unique information block. 23. The method of claim 22, wherein while ranking, in addition for document fragments with the coincident number of information block groups the number of occurrences of said information block groups in the text content of fragments of the selected subject group is determined. 24. The method of claim 1, wherein the fragments from the list of document fragments received as a result of search are displayed with highlighting differences between their text content and the text content of the selected fragment. 25. The method of claim 0, wherein a document which is being added to the archive is saved as a new version of previously saved document which has a specified similarity measure by text and/or semantic content with the document which is being added to the archive, and said new version is saved as a complete copy of the document which is being added to the archive or in the form of differences between the text content of the document which is added to the archive and the text content of said previously saved document. 26. The method of claim 0, wherein fragments which are saved on a data storage device are automatically classified. 27. The method of searching for electronic documents stored on data storage devices comprising indexing each document which is saved in the archive, defining parameters of searching for electronic documents which include forming the initial query containing two or more information blocks, specifying the maximal number of information blocks which may appear between said two or more information blocks in the desired document (the interval) and the order of alternation of said two or more information blocks in the desired document in the specified interval ranking the list of documents received as a result of search, said method further comprises: extending the initial query containing two or more information blocks by preprocessing one or more information blocks from the initial query, where the preprocessing of the initial query is an operation of obtaining from at least one unique information block one or more information blocks connected with said unique information block by a specified relation, searching for documents using any specified number of information blocks defined as search parameters while forming the initial query. 28. The method of claim 27, wherein an information block is a character string appeared in a document and limited by certain characters. 29. The method of claim 27, wherein a program module for processing documents is used to divide documents and their fragments into a set of information blocks. 30. The method of claim 27, wherein at least one operation of initial query preprocessing is selected and users specify the order of initial query preprocessing operations. 31. The method of claim 30, wherein the preprocessing of the initial query consists of one logical operations of identity. 32. The method of claim 27, wherein after performing the operation of preprocessing, at least one information block is removed or left in the query 33. The method of claim 27, wherein the search for documents is carried out on local and/or remote data storage devices. 34. The method of claim 33, wherein a remote device is any information resource or information retrieval system that operates in a computer network and provides in response to a search query a list of document fragments relevant to search parameters. 35. The method of claim 27, wherein the list of document fragments received as a result of search is ranked in accordance with the number of occurrences of relevant information block sequences in documents, the length of the interval in which said sequences fall and weighting coefficient assigned to information blocks. 36. The method of claim 35, wherein documents are displayed with highlighting the information block sequences which correspond to search parameters and the preprocessed query, and providing the navigation through the text content of documents using information block sequences within a whole list of documents received after search.

Description

Настоящее изобретение относится к способам поиска информации, хранимой на локальных и удаленных устройствах хранения данных. В частности изобретение относится к способам поиска на устройствах хранения данных электронных документов и их фрагментов, похожих по текстовому и/или смысловому содержимому на выбранный фрагмент. Изобретение также относится к способам поиска документов по запросам, состоящим из двух и более слов, расширенных смысловыми аналогами и с учетом интервала между словами и порядка их чередования в запросе.The present invention relates to methods for retrieving information stored on local and remote data storage devices. In particular, the invention relates to methods for searching electronic devices and their fragments in data storage devices similar in text and / or semantic content to the selected fragment. The invention also relates to methods for searching documents for queries consisting of two or more words, extended semantic analogues and taking into account the interval between words and the order of their alternation in the query.

Характеристика уровня техникиDescription of the Related Art

Способы полнотекстового индексного поиска электронных документов, хранимых на устройствах хранения данных, достаточно хорошо известны и применяются в различных информационно-поисковых системах, функционирующих как на локальных устройствах хранения данных, так и в условиях распределенных компьютерных систем, в том числе в Интернет. Такие системы позволяют осуществлять поиск интересующих пользователя документов, хранимых в архивах данных, в соответствии с заданными параметрами.Full-text index search methods for electronic documents stored on data storage devices are quite well known and are used in various information retrieval systems that operate both on local data storage devices and in distributed computer systems, including the Internet. Such systems allow you to search for documents of interest to the user stored in data archives, in accordance with the specified parameters.

Одним из направлений развития технологий полнотекстового поиска является разработка систем позволяющих осуществлять фразовый поиск документов. Такие системы предусматривают формирование исходного запроса, состоящего из двух и более слов, и определение параметров поиска, включающих указание интервала между словами и последовательности их чередования в искомых документах. При этом существующие системы позволяют осуществлять поиск документов, содержащих в указанном интервале любое количество слов из исходного запроса. Полученный в результате выполнения поисковой операции список включает документы, релевантные исходному запросу лишь по формальному признаку сходства словарного состава исходной и входящих в упомянутые документы фраз. При этом наиболее качественные системы предусматривают расширение запроса морфологическими словоформами, а также синонимами из предопределенного разработчиками словаря. Расширение запросов другими смысловыми аналогами, например, посредством семантических тезаурусов, а также использование при расширении запроса одновременно нескольких обработчиков, в том числе созданных пользователем, не предусмотрено. Это отрицательно влияет на результативность поисковых операций.One of the directions in the development of full-text search technologies is the development of systems allowing phrasal search of documents. Such systems provide for the formation of an initial query consisting of two or more words, and the determination of search parameters, including an indication of the interval between words and the sequence of their alternation in the desired documents. Moreover, existing systems allow you to search for documents containing in the specified interval any number of words from the original query. The list obtained as a result of a search operation includes documents that are relevant to the original query only by the formal attribute of the similarity of the vocabulary of the original and phrases included in the mentioned documents. Moreover, the most high-quality systems provide for expanding the query with morphological word forms, as well as synonyms from a dictionary predefined by the developers. Expanding queries with other semantic analogs, for example, through semantic thesauruses, as well as using several handlers when expanding a query at the same time, including user-created ones, is not provided. This negatively affects the performance of search operations.

Включение в полученный список тех документов, которые содержат фразы, похожие на заданные не только по текстовому содержимому, но и по смысловому (использование семантических тезаурусов) не предусмотрено, что снижает результативность поисковых операций.The inclusion in the list of those documents that contain phrases similar to those specified not only in textual content, but also in semantic (the use of semantic thesauruses) is not provided, which reduces the effectiveness of search operations.

Существующие системы, предназначенные для полнотекстового фразового поиска, не решают всех проблем, связанных с повышением эффективности поисковых операций. В первую очередь это связано с ростом объемов информации, которую приходится обрабатывать любому пользователю, работающему за компьютером. К сожалению, информация, попадая в архив, расположенный на устройстве хранения данных из различных источников, зачастую дублирует ранее сохраненные данные. При этом формируемые в результате обычного полнотекстового поиска списки документов могут содержать дубли, на повторную обработку которых пользователь вынужден затрачивать время. Особенно актуальна такая проблема в том случае, когда в многопользовательском режиме работы несколько пользователей одновременно осуществляют наполнение архива документами одной тематики.Existing systems designed for full-text phrase search do not solve all the problems associated with improving the efficiency of search operations. This is primarily due to the increase in the amount of information that any user working on a computer has to process. Unfortunately, information falling into an archive located on a data storage device from various sources often duplicates previously stored data. At the same time, lists of documents generated as a result of the usual full-text search may contain duplicates, for which the user is forced to spend time on reprocessing. Such a problem is especially relevant in the case when in a multi-user mode of operation several users simultaneously fill the archive with documents of the same subject.

Поэтому в последнее время весьма популярными становятся системы, предназначенные для поиска документов, похожих по текстовому и/или смысловому содержимому. Их популярность обусловлена тем, что они призваны избавить пользователя от ряда проблем, главная из которых связана с существенными временными затратами на обработку избыточной (дублирующейся) информации. Такие системы позволяют оптимизировать процесс наполнения архива документами путем отсеивания дублей. Кроме того, автоматическая классификация сохраняемых на устройстве хранения данных документов способствуют упорядочиванию хранимого архива информации.Therefore, recently, systems designed to search for documents similar in text and / or semantic content have become very popular. Their popularity is due to the fact that they are designed to save the user from a number of problems, the main of which is associated with significant time costs for processing redundant (duplicate) information. Such systems allow you to optimize the process of filling the archive with documents by screening takes. In addition, the automatic classification of documents stored on a data storage device helps streamline the stored information archive.

При этом существующие системы, предназначенные для поиска похожих по текстовому и/или смысловому содержимому документов, не лишены недостатков. Главным из которых является то, что они не позволяют задействовать для поиска все множество слов, встречающихся в тексте выбранного документа. Это связано с тем, что отработка поискового запроса, состоящего из большого количества слов, требует значительных временных затрат и потому неприемлема для пользователя. Проблема еще более усугубляется в случае расширения исходного запроса смысловыми аналогами, например, путем использования морфологических анализаторов или тезаурусов. Поэтому существующие системы, предназначенные для поиска документов, похожих по текстовому содержимому, позволяют задействовать в запросе лишь определенное количество значимых (ключевых) слов, входящих в документ, для которого осуществляют поисковую операцию. Такой подход к реализации существенно снижает эффективность выполнения операций поиска похожих по текстовому содержимому документов и не позволяет достичь желаемого результата.Moreover, existing systems designed to search for documents similar in text and / or semantic content are not without drawbacks. The main one is that they do not allow you to use the entire set of words found in the text of the selected document to search. This is due to the fact that the development of a search query, consisting of a large number of words, requires considerable time and is therefore unacceptable to the user. The problem is further exacerbated if the original query is expanded with semantic analogues, for example, by using morphological analyzers or thesauruses. Therefore, existing systems designed to search for documents similar in textual content allow only a certain number of significant (key) words that are part of the document to be searched for to be used in the request. This approach to implementation significantly reduces the efficiency of performing search operations for documents similar in text content and does not allow to achieve the desired result.

Характеристика аналоговCharacteristics of analogues

В качестве аналога для способа поиска похожих по текстовому и/или смысловому содержимому документов и их фрагментов выбран метод и устройство для поиска текста с помощью сигнатур документов [Патент И86029167 международный класс С06Р 017/00]. Метод предназначен для поиска похожих и идентичных фрагментов документов, хранимых в базе данных.As an analogue for the method for searching documents similar to textual and / or semantic content and their fragments, a method and device for searching for text using document signatures were selected [Patent I86029167 international class С06Р 017/00]. The method is designed to search for similar and identical fragments of documents stored in the database.

Метод позволяет кодировать фрагменты текстов документов при помощи последовательности маркеров. При этом каждому фрагменту присваивается сигнатура маркеров. Закодированный фрагмент сравнивают с закодированными таким же образом фрагментами, хранимыми в базе данных. Сравнение осуществляют по последовательностям маркеров (сигнатурам), присущих фрагментам. В случае обнаружения в базе данных фрагментов, похожих на выбранный (с идентичными сигнатурами), осуществляют извлечение из базы данных документов, содержащих фрагменты, похожие на выбранный. После чего осуществляют сравнение выбранного фрагмента с найденными в базе данных документами при помощи поиска по последовательным строкам символов, либо каждое слово из исходного фрагмента сравнивают с каждым словом из найденных документов.The method allows you to encode fragments of the texts of documents using a sequence of markers. At the same time, a marker signature is assigned to each fragment. The encoded fragment is compared with the same encoded fragments stored in the database. The comparison is carried out according to the sequences of markers (signatures) inherent in the fragments. In the event that fragments similar to the selected one (with identical signatures) are found in the database, documents containing fragments similar to the selected one are extracted from the database. After that, the selected fragment is compared with the documents found in the database by searching through successive lines of characters, or each word from the source fragment is compared with each word from the found documents.

Недостатком изобретения И86029167 является то, что оно предназначено для поиска документов и их фрагментов лишь по формальным признакам соответствия слов, входящих в их текстовое содержимое. При этом изобретение И86029167 не предусматривает расширение поисковых запросов какими-либо аналогами (морфологическими словоформами, синонимами и т.д.). Т.е. И86029167 осуществляет поиск документов, похожих лишь по текстовому содержимому, и не позволяет находить документы, имеющие сходство с выбранным по смысловому содержимому.The disadvantage of I86029167 invention is that it is intended to search for documents and their fragments only by formal signs of matching words included in their text content. Moreover, the invention I86029167 does not provide for the expansion of search queries by any analogues (morphological word forms, synonyms, etc.). Those. I86029167 searches for documents that are similar only in textual content, and does not allow you to find documents that are similar to the one selected in semantic content.

Следующим недостатком изобретения И86029167 является то, что в нем отсутствует возможность выбора методики формирования поискового запроса. Т. е. поиск осуществляют только с использованием определенных образом сформированных последовательностей маркеров. Возможность поиска похожих документов или фрагментов с использованием других методик формирования поискового запроса не реализована, что ограничивает функциональные возможности изобретения поиском плагиата в различных документах.A further disadvantage of the invention I86029167 is that it lacks the possibility of choosing a methodology for generating a search query. That is, the search is carried out only using a specific manner of formed marker sequences. The ability to search for similar documents or fragments using other methods of generating a search query is not implemented, which limits the functionality of the invention to the search for plagiarism in various documents.

В качестве аналога для способа поиска похожих по текстовому и/или смысловому содержимому документов и их фрагментов выбран метод поиска и извлечения документов при помощи приложений для автоматического персонализированного поиска в базе данных [Патент И85926812 международный класс С06Р 017/30]. Метод позволяет осуществлять поиск интере сующих пользователя документов на устройствах хранения данных. Метод включает в себя следующие операции:As an analogue for the method for searching documents and fragments similar in text and / or semantic content, the method of searching and retrieving documents using applications for automatic personalized search in the database was selected [Patent I85926812 international class С06Р 017/30]. The method allows you to search for documents of interest to the user on data storage devices. The method includes the following operations:

- определение множества слов наиболее часто встречаемых в документах, хранимых в архиве на пользовательском устройстве. При этом учитывается число вхождений слов в документы и их важность, определяемая расположением в заголовках и т.д.,- determination of the set of words most often found in documents stored in the archive on the user device. This takes into account the number of occurrences of words in documents and their importance, determined by the location in the headings, etc.,

- пересылка полученного множества слов удаленному устройству хранения данных и поиск на нем документов, соответствующих упомянутому множеству слов. Формирование множества документов, соответствующих запросу,- forwarding the resulting set of words to a remote data storage device and searching on it for documents corresponding to said set of words. Formation of a multitude of documents matching the request,

- извлечение из архива, хранимого на удаленном устройстве документов, имеющих наивысшую степень сходства с документами, хранимыми на пользовательском устройстве и их отображение пользователю.- extracting from the archive stored on the remote device documents that have the highest degree of similarity with documents stored on the user device and display them to the user.

Недостатком изобретения И85926812 является то, что оно предназначено для поиска документов лишь по формальным признакам соответствия слов, входящих в их текстовое содержимое. При этом изобретение И85926812 не предусматривает расширение поисковых запросов какими-либо аналогами (морфологическими словоформами, синонимами и т.д.). Т.е. И85926812 осуществляет поиск документов, похожих лишь по текстовому содержимому, и не позволяет находить документы, имеющие сходство с выбранным документом по смысловому содержимому.The disadvantage of the invention I85926812 is that it is intended to search for documents only by formal signs of matching words included in their text content. In this case, the invention I85926812 does not provide for the expansion of search queries by any analogues (morphological word forms, synonyms, etc.). Those. I85926812 searches for documents that are similar only in textual content, and does not allow you to find documents that are similar to the selected document in terms of semantic content.

Следующим недостатком изобретения И85926812 является то, что в нем отсутствует возможность выбора методики формирования поискового запроса. Т. е. поиск осуществляют только с использованием наиболее значимых (ключевых) слов, присущих заданному множеству документов. Возможность поиска похожих документов с использованием других методик формирования поискового запроса не реализована. Например, изобретение не предоставляет возможности осуществить поиск по словам, встречающимся в текстовом содержимом заданного множества документов с определенной частотой, что ограничивает функциональные возможности изобретения.A further disadvantage of the I85926812 invention is that it does not have the ability to select a methodology for generating a search query. That is, the search is carried out only using the most significant (keywords) words inherent in a given set of documents. The ability to search for similar documents using other methods of forming a search query is not implemented. For example, the invention does not provide the ability to search for words that occur in the text content of a given set of documents with a certain frequency, which limits the functionality of the invention.

В качестве аналога для способа поиска документов по исходному запросу из двух и более слов выбран метод и устройство индексирования и поиска документов [Патент νΟ 9959085 международный класс С06Р 17/30]. Метод включает в себя следующие операции:As an analogue for the method of searching for documents by the initial query of two or more words, the method and device for indexing and searching for documents were selected [Patent ν поиска 9959085 international class С06Р 17/30]. The method includes the following operations:

- разбиение заданного документа на элементы (последовательности), состоящие из 3 слов. Т.е. документ из N слов разбивают на N-2 элементов, например, документ из пяти слов '01234' разбивают на три последовательности '012', '123' и '234',- splitting a given document into elements (sequences) consisting of 3 words. Those. a document of N words is divided into N-2 elements, for example, a document of five words '01234' is divided into three sequences '012', '123' and '234',

- присваивание каждому полученному для заданного документа элементу определенной оценки,- assignment to each element received for a given document of a certain rating,

- сравнение оценок (всех или некоторого их подмножества), полученных для заданного документа с оценками в индексе базы данных хранимых документов, определение ГО хранимых документов с соответствующими оценками и их поиск в базе данных,- comparison of the ratings (all or some subset of them) obtained for a given document with the ratings in the index of the database of stored documents, the definition of GO of stored documents with the corresponding ratings and their search in the database,

- подсчет количества элементов, присутствующих в заданном и найденных документах для определения их идентичности.- counting the number of elements present in the given and found documents to determine their identity.

Метод предполагает использование устройства, которое обеспечивает выполнение операций, предусмотренных изобретением АО 9959085.The method involves the use of a device that ensures the execution of operations provided for by the invention of AO 9959085.

Недостатком изобретения АО 9959085 является то, что оно предназначено для поиска документов лишь по формальным признакам соответствия последовательностей, входящих в документы слов, что не позволяет осуществлять поиск документов, имеющих смысловое сходство.The disadvantage of AO 9959085 is that it is intended to search for documents only by formal signs of matching sequences included in word documents, which does not allow searching for documents that have semantic similarity.

В качестве аналога для способа поиска документов по исходному запросу из двух и более слов выбрана также Интернет поисковая машина Ωοίρίηοη |\у\у\у.бс1р1поп.сот|. Данная поисковая система позволяет осуществлять фразовый поиск интересующих пользователя документов (патентов) по запросу, состоящему из двух и более слов с учетом расстояния между ними и порядка их чередования в искомом документе |1Шр://\у\у\у.бс1р1иоп.сот/11с1р/1апд11с1р|. Запрос при этом дополнительно расширяют синонимами для одного и более слов, входящих в запрос.The Internet search engine Ωοίρίηοη | \ y \ y \ u \ bs1r1pop.sot | was also selected as an analogue for the method of searching for documents by the initial query of two or more words. This search system allows a phrasal search of documents (patents) of interest to a user by a request consisting of two or more words, taking into account the distance between them and the order of their alternation in the desired document | 1Wp: // \ y \ y \ u.bs1r1iop.sot / 11s1r / 1apd11s1r |. In this case, the request is further expanded with synonyms for one or more words included in the request.

Недостатком поисковой машины Ωοίρίιίοη является то, что она позволяет расширять запрос только синонимами из предопределенного разработчиками словаря. Расширение запроса другими смысловыми аналогами, например, посредством семантических тезаурусов, а также использование при расширении запроса заданной последовательности обработчиков в Эс1рЫои не предусмотрено, что отрицательно влияет на результативность поисковых операций. Кроме того, в Ωοίρίιίοη не предусмотрена возможность создания пользователем собственных тезаурусов.The disadvantage of the search engine Ωοίρίιίοη is that it allows you to expand the query only with synonyms from a dictionary predefined by the developers. Expanding the query with other semantic analogs, for example, through semantic thesauri, as well as the use of the specified sequence of handlers when expanding the query, is not provided in EslpOoi, which negatively affects the effectiveness of search operations. In addition, Ωοίρίιίοη does not provide for the possibility of the user creating their own thesauri.

Задача, решаемая изобретениемThe problem solved by the invention

Задача, решаемая изобретением, заключается в оптимизации поиска на устройствах хранения данных документов и их фрагментов, имеющих сходство не только по текстовому, но и по смысловому содержимому, а также в устранении дублирования информации, хранимой в архиве. Задача, решаемая изобретением, заключается также в оптимизации поиска документов по запросам из двух и более слов с учетом интервала между ними и порядка их чередования в искомом документе. При этом фразовый поиск документов осуществляют не только по формальному признаку соответствия запросу, но и с учетом смысловых аналогов. Задача решается за счетThe problem solved by the invention is to optimize the search on the data storage devices of documents and their fragments having similarities not only in text but also in semantic content, as well as in eliminating duplication of information stored in the archive. The problem solved by the invention is also to optimize the search for documents by requests of two or more words, taking into account the interval between them and the order of their alternation in the desired document. In this case, a phrasal search for documents is carried out not only by the formal sign of compliance with the request, but also taking into account semantic analogues. The problem is solved by

- использования в качестве параметров поиска документов, похожих по текстовому и/или смысловому содержимому на выбранный как всего множества слов, входящих в выбранный документ, так и установленного количества слов из множества, сформированного по определенным правилам, в том числе заданным пользователем,- using, as search parameters, documents similar in text and / or semantic content to the selected one as the whole set of words included in the selected document, and the specified number of words from the set formed according to certain rules, including those specified by the user,

- использования различных методик формирования множества слов, используемого в качестве поискового запроса в зависимости от целей поиска,- the use of various techniques for the formation of many words used as a search query depending on the goals of the search,

- определения степени сходства документов и их автоматической классификации при занесении в архив,- determine the degree of similarity of documents and their automatic classification when entering the archive,

- расширения исходных запросов посредством добавления в него смысловых аналогов для слов, входящих в исходный запрос с использованием операций предварительной обработки,- expanding the source queries by adding semantic analogues to it for words included in the original query using pre-processing operations,

- формирования списков документов и их фрагментов, соответствующих параметрам поиска и их ранжирования в соответствии с релевантностью по отношению к запросу, полученному, в том числе, с учетом предварительной обработки,- the formation of lists of documents and their fragments corresponding to the search parameters and their ranking in accordance with relevance to the request received, including taking into account the preliminary processing,

- визуализации в отображаемых документах результатов, соответствующих параметрам поиска и запросу, полученному после предварительной обработки.- visualization in the displayed documents of the results corresponding to the search parameters and the request received after preliminary processing.

Краткое описание чертежейBrief Description of the Drawings

На фиг. 1 и 2 показан процесс предварительной обработки поискового запроса, заключающийся в использовании обработчиков двух типов.In FIG. 1 and 2 show the process of preprocessing a search query, which consists in using two types of handlers.

На фиг. 3 показан процесс предварительной обработки поискового запроса, состоящего из одного слова с использованием нескольких дополняющих обработчиков.In FIG. Figure 3 shows the process of pre-processing a search query, consisting of one word using several complementary handlers.

На фиг. 4 показан способ предварительной обработки поискового запроса.In FIG. 4 shows a method for preprocessing a search query.

На фиг. 5 показан алгоритм формирования списка документов для слов, включенных в поисковый запрос и объединенных булевским оператором 'ОК'.In FIG. 5 shows an algorithm for generating a list of documents for words included in a search query and combined by the Boolean operator 'OK'.

На фиг. 6 и 7 схематически показаны методы разбиения документов на фрагменты.In FIG. 6 and 7 schematically show methods for breaking documents into fragments.

На фиг. 8 показан алгоритм поиска на устройствах хранения данных фрагментов документов, похожих по текстовому и/или смысловому содержимому на выбранный фрагмент.In FIG. Figure 8 shows the search algorithm on the data storage devices for fragments of documents similar in text and / or semantic content to the selected fragment.

На фиг. 9 показан алгоритм поиска на устройствах хранения данных документов по запросу из двух и более исходных слов, с учетом интервала между словами и порядка их чередования в искомых фразах.In FIG. Figure 9 shows the search algorithm on the data storage devices for documents upon request from two or more source words, taking into account the interval between words and the order of their alternation in the desired phrases.

Основные положения и определенияKey Points and Definitions

Обработка информации на компьютере, реализованная в настоящем изобретении, состоит из операций, добавления, вызова, получения, передачи, сравнения информации и т.д., что часто ассоциируется с ручными операциями, выполняемыми оператором. Описанные здесь операции являются машинными операциями, выполняемыми в сочетании с различными входными данными, предоставляемыми оператором или пользователем, который взаимодействует с компьютером.Information processing on a computer implemented in the present invention consists of operations, adding, calling, receiving, transmitting, comparing information, etc., which is often associated with manual operations performed by the operator. The operations described here are machine operations performed in conjunction with various input data provided by an operator or user who interacts with a computer.

Ключевым моментом, определяющим единство замысла настоящих изобретений, является использование предварительной обработки первоначально сформулированного пользователем поискового запроса.The key to determining the unity of purpose of the present invention is the use of pre-processing of the search query originally formulated by the user.

Предварительная обработка запросаRequest pre-processing

В качестве операций предварительной обработки используют преобразования по различным правилам, по которым из исходного слова получают одно или несколько слов, связанных с исходным словом заданным соотношением. К таким правилам относится использование различных словарей и тезаурусов: морфологических, синонимических, семантических, двуязычных и т.д., а также использование различных функций преобразования. К функциям преобразования, например, относятся: замена строчных символов на прописные и наоборот, замена латинских букв кириллицей и наоборот и т.д. Правила, используемые для предварительной обработки (словари, тезаурусы и функции преобразования) могут быть как заранее предопределенными, так и созданными самим пользователем. Пользователь формирует свои собственные правила в соответствии с известными ему критериями. Например, для удобства поиска видеофильмов, пользователь создаст собственный тезаурус, в соответствии с которым расширяет поисковый запрос. Т.е., любитель комедийных фильмов с участием конкретных актеров составит тематический тезаурус, в котором свяжет слово 'комедия' с фамилиями актеров 'Ришар', 'Мэрфи' и т.д., что позволит ему в дальнейшем оптимизировать поиск нужных данных. После отработки запроса 'комедия', расширенного с использованием такого тематического тезауруса, в результирующем списке будут представлены документы, содержащие также слова 'Ришар' и 'Мэрфи'. Таким образом, формируют любое количество тематических тезаурусов для дальнейшего их использования в качестве обработчиков. Настоящее изобретение предусматривает выбор, по меньшей мере, одной операции предварительной обработки запроса, причем последовательность выполнения операций предварительной обработки запроса задают пользователи. Механизм предварительной обработки запроса предусматривает использование двух типов обработчиков: расширяющего обработчика (РО) и дополняющего обработчика (ДО).As pre-processing operations, transformations are used according to various rules, according to which one or several words associated with the initial word with a given ratio are obtained from the original word. Such rules include the use of various dictionaries and thesauri: morphological, synonymous, semantic, bilingual, etc., as well as the use of various transformation functions. The conversion functions, for example, include: replacing lowercase characters with uppercase and vice versa, replacing Latin letters with Cyrillic and vice versa, etc. The rules used for pre-processing (dictionaries, thesauruses and conversion functions) can be either predefined or created by the user himself. The user forms his own rules in accordance with the criteria known to him. For example, for the convenience of searching for movies, the user will create his own thesaurus, according to which he expands the search query. That is, a lover of comedy films with the participation of specific actors will compile a thematic thesaurus, which will connect the word 'comedy' with the names of the actors 'Richard', 'Murphy', etc., which will allow him to further optimize the search for the necessary data. After working out the query 'comedy' expanded using such a thematic thesaurus, documents containing the words 'Richard' and 'Murphy' will also be presented in the resulting list. Thus, any number of thematic thesauruses are formed for their further use as handlers. The present invention provides for the selection of at least one request preprocessing operation, wherein the sequence of execution of the request preprocessing operations is specified by users. The request preprocessing mechanism provides for the use of two types of handlers: an expanding handler (RO) and a complementing handler (DO).

Каждое из слов, используемых для предварительной обработки запроса, входит в какуюлибо логическую группу, формируемую для слов, имеющих сходство по определенному признаку. Например, слова 'зеленый', 'зеленые', 'зеленых' и т.д. входят в одну группу слов, объединенных по морфологическому признаку (однокоренные слова). Необходимо заметить, что в качестве логических групп для РО могут быть использованы группы слов, объединенных признаками, отличными от морфологического признака. В частном случае, логическая группа представлена одним словом. Логические группы слов используются при обработке запроса посредством расширяющего обработчика (РО). При этом, каждое слово входит только в одну логическую группу слов, имеющих сходство по определенному признаку (например, только в одну морфологическую группу). РО используется для выполнения двух типов преобразования, а именноEach of the words used for preliminary processing of the request is included in a logical group formed for words that have similarities by a certain attribute. For example, the words 'green', 'green', 'green', etc. are included in one group of words, united by a morphological basis (single-root words). It should be noted that as logical groups for RO groups of words can be used, united by characters that are different from the morphological character. In a particular case, a logical group is represented in one word. Logical word groups are used when processing a request through an extension handler (RO). At the same time, each word is included in only one logical group of words that have similarities by a certain attribute (for example, only in one morphological group). PO is used to perform two types of conversion, namely

- от слова к группе, т. е. преобразования, определяющего идентификатор логической группы нужного типа для каждого из обрабатываемых слов;- from a word to a group, that is, a transformation that defines the identifier of a logical group of the desired type for each of the processed words;

- от группы к слову, т. е. преобразования, определяющего по идентификатору логической группы все входящие в данную группу слова.- from a group to a word, that is, a transformation that identifies all words in a given group by the identifier of a logical group.

В качестве ДО используют различные словари и тезаурусы: синонимические, семантические, двуязычные и т.д., а также различные функции преобразования, описанные выше (замена строчных символов на прописные и наоборот и т.д.). ДО используется для выполнения преобразования по следующему алгоритму:Various dictionaries and thesauruses are used as DOs: synonymous, semantic, bilingual, etc., as well as various conversion functions described above (replacing lowercase characters with uppercase and vice versa, etc.). DO is used to perform the conversion according to the following algorithm:

- от группы к группе, т. е. преобразование, определяющее по идентификатору конкретной логической группы идентификаторы групп, соответствующие конкретной логической группе.- from group to group, i.e., a transformation that identifies group identifiers corresponding to a specific logical group by the identifier of a particular logical group.

Такой подход к реализации обработчиков оптимизирует процесс предварительной обработки запроса. Т. е. пользователю не приходится вручную устанавливать связи между всеми словами, входящими в связанные логические группы. Такая обработка будет осуществляться автоматически.This approach to the implementation of the handlers optimizes the process of preliminary processing of the request. That is, the user does not have to manually establish connections between all the words included in the associated logical groups. Such processing will be carried out automatically.

Схематично весь процесс предварительной обработки запроса с использованием обработчиков двух типов (РО и ДО) показан на фиг. 1 в виде графа обработчиков, в котором первая вершина представляет собой РО (преобразование от слова к группе), а все последующие вершины представляют собой ДО (преобразование от группы к группе). Данный процесс иллюстрирует использование неограниченного количества (одного и более) ДО на каждом из этапов предварительной обработки. Т.е., логическая группа, полученная после преобразования посредством РО, обрабатывается с использованием ДО-1/1 - ДО-1/Ν, группы, полученные после преобразования посредством ДО-1/1, обрабатываются с использованием ДО-1/1/1 — ДОЛ/1/Ν, группы, полученные после преобразования посредством ДО-1 /Ν, обрабатываются с использованием ДО-1/№1-ДО-1/М/Ы и т.д.Schematically, the entire process of preliminary processing of a request using handlers of two types (PO and DO) is shown in FIG. 1 in the form of a handler graph, in which the first vertex is a PO (word-to-group conversion), and all subsequent vertices are DO (group-to-group conversion). This process illustrates the use of an unlimited number (one or more) of DO at each of the stages of pre-processing. That is, the logical group obtained after conversion by PO is processed using DO-1/1 - DO-1 / Ν, the groups obtained after conversion by DO-1/1 are processed using DO-1/1 / 1 - DOL / 1 / Ν, groups obtained after conversion by DO-1 / Ν are processed using DO-1 / No. 1-DO-1 / M / S, etc.

Для того, чтобы облегчить восприятие описываемого процесса предварительной обра ботки запроса, приведем конкретный пример, в котором осуществляют предварительную обработку запроса, состоящего из одного слова с использованием трех дополняющих обработчиков как это показано на фиг. 2. Осуществляют получение тематических данных из базы данных, содержащей документы на различных языках. Причем упомянутые данные должны соответствовать исходному запросу не только в формальном, но и в смысловом значении. Т.е. пользователь осуществляет поиск интересующей его информации, хранимой не только на родном для него языке (например, русском), но также и той информации, которая хранится на других языках (например, английском и французском).In order to facilitate the perception of the described process of preliminary processing of a request, we give a specific example in which preliminary processing of a request consisting of one word is carried out using three complementary processors as shown in FIG. 2. Carry out the obtaining of thematic data from a database containing documents in various languages. Moreover, the data mentioned must correspond to the initial request, not only in the formal, but also in the semantic meaning. Those. the user searches for information of interest to him, stored not only in his native language (for example, Russian), but also that information that is stored in other languages (for example, English and French).

Таким образом, осуществляют поиск в архиве, данные в котором хранятся на различных языках. Причем для получения искомых данных исходный запрос расширяют с использованием операций предварительной обработки. При этом используют заданную последовательность предварительной обработки запроса, которая заключается в поэтапном использовании расширяющего морфологического обработчика, дополняющего синонимического обработчика и двух дополняющих обработчиков, представленных двуязычными словарями для перевода (русско-английским и русско-французским).Thus, they perform a search in the archive, the data in which is stored in various languages. Moreover, to obtain the desired data, the original query is expanded using pre-processing operations. At the same time, they use the specified sequence of preliminary processing of the request, which consists in the phased use of an expanding morphological processor, a complementary synonymous processor, and two complementary processors represented by bilingual dictionaries for translation (Russian-English and Russian-French).

Используют следующий алгоритм предварительной обработки. Пользователь формирует исходный запрос, который состоит из одного слова и при этом задает тип и количество обработчиков, а также последовательность их использования. Поскольку осуществляется тематический поиск данных, на этапе 1, как это показано на фиг.2, целесообразно использовать в качестве РО морфологический преобразователь, который определяет идентификатор логической группы для каждого из слов, входящих в запрос (в данном примере одного слова). После выполнения этапа 1 будет получена группа, включающая в себя слова, объединенные с исходным словом морфологическим признаком (однокоренные слова). Полученная группа (первая группа) может включать в себя одно и более слово. Поскольку искомые данные должны соответствовать исходному запросу не только в формальном, но и в смысловом значении, на этапе 2 целесообразно использование дополняющего обработчика, осуществляющего формирование множества групп, связанных с первой группой синонимическим признаком. Таким образом, в качестве первого дополняющего обработчика ДО-1 будет использован словарь синонимов, что позволит расширить исходный запрос смысловыми аналогами (синонимами). ДО-1 осуществляет преобразование от группы к группе и определит идентификаторы групп для группы, полученной на этапе 1 (первой группы), в соответствии с используемым преобразованием. Слова, входящие в полученные на этапе 2 группы, составляют синонимический ряд. Одним из условий проведения предварительной обработки запроса для данного примера является обязательное присутствие в множестве групп, полученных на этапе 2 первой группы. Поскольку слова, входящие в первую группу, необходимы для проведения дальнейшего поиска информации, и их исключение из процесса дальнейшей обработки приведет к снижению результативности выполнения поисковой операции. В других случаях первая группа может быть исключена из множества групп, полученных на этапе 2, и тем самым не подвергаться дальнейшей обработке. Далее, на этапе 3, каждая из групп, полученных на этапе 2, обрабатывается с использованием двух дополняющих обработчиков: ДО-2, в качестве которого используют русско-английский словарь и ДО-3, в качестве которого используют русскофранцузский словарь. В результате выполнения этапа 3 формируется множество групп, связанных с группами, полученными на этапе 2 соответствующими преобразованиями.Use the following preprocessing algorithm. The user forms the initial request, which consists of one word and at the same time sets the type and number of handlers, as well as the sequence of their use. Since a thematic data search is carried out, at stage 1, as shown in Fig. 2, it is advisable to use a morphological transducer as a PO, which determines the identifier of a logical group for each of the words in the query (in this example, one word). After performing stage 1, a group will be obtained that includes words combined with the original word by a morphological trait (root words). The resulting group (first group) may include one or more words. Since the searched data must correspond to the initial query not only in formal, but also in semantic meaning, at stage 2 it is advisable to use a complementary processor that generates many groups associated with the first group with a synonymous sign. Thus, the dictionary of synonyms will be used as the first complementary processor DO-1, which will expand the original query with semantic analogues (synonyms). DO-1 converts from group to group and determines the group identifiers for the group obtained in stage 1 (the first group), in accordance with the conversion used. The words included in the group obtained in stage 2 constitute a synonymous series. One of the conditions for pre-processing the request for this example is the mandatory presence in the set of groups obtained in stage 2 of the first group. Since the words included in the first group are necessary for further information retrieval, and their exclusion from the further processing process will lead to a decrease in the effectiveness of the search operation. In other cases, the first group may be excluded from the plurality of groups obtained in step 2, and thereby not be further processed. Next, in step 3, each of the groups obtained in step 2 is processed using two complementary processors: DO-2, which uses the Russian-English dictionary and DO-3, which uses the Russian-French dictionary. As a result of performing step 3, many groups are formed associated with the groups obtained in step 2 by corresponding transformations.

Все группы, полученные на этапах 1-3, включаются в итоговое множество групп. Далее, из полученного итогового множества групп формируют множество слов, из которых формируют окончательный запрос для осуществления поиска в архиве, содержащем информацию на разных языках (русском, английском, французском), с использованием смысловых аналогов для каждого из упомянутых языков. Т.е. заключительный этап предварительной обработки предусматривает использование расширяющего обработчика (РО) для преобразования по типу от группы к слову для всех групп, входящих в итоговое множество. Это преобразование определяет по идентификаторам логических групп все входящие в конкретные группы слова и формирует из них итоговое множество слов, включающее все слова, присущие полученным группам. Из этих слов впоследствии формируется окончательный запрос.All groups obtained in steps 1-3 are included in the final set of groups. Further, from the resulting final set of groups, a lot of words are formed, from which the final query is generated for searching the archive containing information in different languages (Russian, English, French), using semantic analogues for each of the mentioned languages. Those. The final stage of preprocessing involves the use of an expanding processor (RO) for conversion by type from group to word for all groups included in the final set. This transformation determines from the identifiers of logical groups all the words included in specific groups and forms from them the final set of words, including all the words inherent in the obtained groups. From these words, a final request is subsequently formed.

В приведенном выше примере заданная последовательность обработчиков включает по одному дополняющему обработчику конкретного типа: один синонимический для русского языка, один русско-английский для перевода и один русско-французский. Такой подход к формированию последовательности обработчиков является частным случаем. В общем случае, в заданную последовательность при необходимости включается в любом порядке любое количество дополняющих обработчиков конкретного типа (несколько синонимических, несколько двуязычных и т.д.). Причем возможен вариант, при котором дополняющий обработчик одного типа (например, синонимический для русского языка) будет участвовать в обработке одной и той же группы несколько раз.In the above example, the given sequence of processors includes one complementary processor of a specific type: one synonymous for Russian, one Russian-English for translation, and one Russian-French. This approach to the formation of a sequence of handlers is a special case. In the general case, if necessary, any number of complementary handlers of a particular type (several synonymic, several bilingual, etc.) is included in the given sequence in any order. Moreover, a variant is possible in which the complementary processor of the same type (for example, synonymous for the Russian language) will participate in the processing of the same group several times.

Для упрощения восприятия процесса предварительной обработки запроса опишем приве денный выше пример более детально. Стоит напомнить, что рассматриваемый пример описывает предварительную обработку запроса, состоящего из одного слова. В том случае, когда запрос будет состоять из нескольких слов, алгоритм предварительной обработки будет применен для каждого из слов, входящих в исходный запрос. В качестве исходного запроса для данного примера, как это показано на фиг. 3, используют слово 'Информация'. Предварительная обработка осуществляется с использованием РО, объединяющего в логические группы слова по морфологическому признаку. Кроме этого используются три дополняющих обработчика: синонимический для русского языка, русскоанглийский для перевода и русско-французский. Преобразования посредством РО обозначены на фиг. 3 пунктирными линиями, а преобразования посредством ДО жирными линиями.To simplify the perception of the process of preliminary processing of the request, we describe the above example in more detail. It is worth recalling that this example describes the preliminary processing of a request consisting of a single word. In the case when the request will consist of several words, the pre-processing algorithm will be applied for each of the words included in the original request. As an initial request for this example, as shown in FIG. 3, use the word 'Information'. Preliminary processing is carried out using RO, combining words into logical groups according to morphological characteristics. In addition, three complementary processors are used: synonymous for Russian, Russian-English for translation, and Russian-French. The transformations by PO are indicated in FIG. 3 by dashed lines, and conversions by DO in bold lines.

Для исходного слова 'Информация', как это показано на фиг. 3, путем преобразования расширяющим обработчиком от слова к группе определяется логическая группа (ЛГ-1), которая включает в себя слова, имеющие общий морфологический признак с исходным словом. Множество слов для ЛГ-1 будет включать в себя слова 'Информация' и 'Информации' (слова, объединенные с исходным словом 'Информация' морфологическим признаком). Разумеется, группа может включать большее количество слов, но введенное в данном примере ограничение на количество слов связано с упрощением восприятия процесса предварительной обработки. Необходимо отметить, что предварительная обработка РО может заключаться в операции тождества, что используется для тех случаев, когда логическая группа состоит из одного слова.For the original word 'Information', as shown in FIG. 3, a logical group (LG-1), which includes words that have a common morphological characteristic with the original word, is determined by transforming the expanding processor from word to group. Many words for LG-1 will include the words 'Information' and 'Information' (words combined with the original word 'Information' by a morphological trait). Of course, a group may include a larger number of words, but the restriction on the number of words introduced in this example is associated with a simplification of the perception of the preprocessing process. It should be noted that the preliminary processing of the PO can consist in the operation of identity, which is used for those cases when the logical group consists of one word.

Далее осуществляется предварительная обработка с использованием дополняющих обработчиков, включающая преобразование от группы к группе. Сначала в соответствии с заданными параметрами предварительной обработки осуществляется обработка синонимическим обработчиком ДО-1 , который формирует множество групп, связанных с ЛГ-1 по синонимическому признаку. Такими группами являются ЛГ-2 и ЛГ-3. Кроме того, одним из условий предварительной обработки запроса является необходимость присутствия в формируемом множестве групп, подлежащих дальнейшей обработке группы ЛГ-1 , как это показано на фиг. 2. Таким образом, после обработки запроса обработчиком ДО-1 получаем множество групп ЛГ-1, ЛГ-2 и ЛГ-3. Разумеется, упомянутое множество может включать большее количество логических групп, но введенное в данном примере ограничение на количество групп связано с упрощением описания процесса предварительной обработки и для облегчения его восприятия. Поскольку группы ЛГ-2 и ЛГ-3 формировались с использованием синонимического преобразо вания, эти группы будут включать в себя слова, составляющие со словом 'Информация' синонимический ряд. Например, множество слов для ЛГ-2 будет включать в себя слова 'Сведения' и 'Сведений', а для ЛГ-3 множество слов будет состоять из 'Сообщение' и 'Сообщения'. Как видно из описания, полученные множества слов связаны между собой синонимическим признаком (обработка ДО-1), а слова в каждом из множеств связаны между собой морфологическим признаком (обработка РО).Next, preprocessing is performed using complementary handlers, including conversion from group to group. First, in accordance with the specified pre-processing parameters, the processing is carried out by the synonymous processor DO-1, which forms many groups associated with LG-1 by a synonymous basis. Such groups are LH-2 and LH-3. In addition, one of the conditions for preliminary processing of the request is the need for the presence of the LG-1 group in the set of groups to be further processed, as shown in FIG. 2. Thus, after processing the request by the DO-1 handler, we get many groups of LG-1, LG-2, and LG-3. Of course, the mentioned set may include a larger number of logical groups, but the restriction on the number of groups introduced in this example is associated with a simplification of the description of the preprocessing process and to facilitate its perception. Since the LG-2 and LG-3 groups were formed using a synonymous transformation, these groups will include words that make up a synonymic series with the word 'Information'. For example, a lot of words for LG-2 will include the words 'Details' and 'Information', and for LG-3 a lot of words will consist of 'Message' and 'Messages'. As can be seen from the description, the resulting sets of words are interconnected by a synonymous sign (processing DO-1), and the words in each of the sets are related by a morphological sign (processing RO).

Данный пример предусматривает последующую обработку полученных групп дополняющими обработчиками ДО-2 и ДО-3. ДО-2 представляет собой русско-английский словарь для перевода, а ДО-3 русско-французский словарь. Обработка каждым из упомянутых обработчиков приводит к получению новых групп, связанных с группами ЛГ-1 , ЛГ-2 и ЛГ-3 соответствующими преобразованиями. Так, для ЛГ1 посредством ДО-2 формируется множество групп, состоящее из группы ЛГ-4, которая включает в себя слова 1и1огта1юи' и '1п1огтаДопаТ. Аналогичным образом предварительная обработка ДО-2 для ЛГ-2 приводит к формированию множества, состоящего из группы ЛГ-6, которая включает слова 'Ма1еДаТ и 'Ма1епа1з\ а обработка для ЛГ-3 - к формированию множества, состоящего из группы ЛГ-8, которая включает слова 'Меззаде' и 'Меззадез'. Связи между множествами слов, входящих в ЛГ-4 и ЛГ-1, также как и для связок ЛГ -6 - ЛГ-2 и ЛГ-8 - ЛГ-3 определяются признаком использования русско-английского словаря для перевода.This example provides for the subsequent processing of the received groups by the additional handlers DO-2 and DO-3. DO-2 is a Russian-English dictionary for translation, and DO-3 is a Russian-French dictionary. Processing by each of the mentioned handlers leads to new groups associated with the LG-1, LG-2 and LG-3 groups with the corresponding transformations. So, for LG1, through DO-2, a multitude of groups is formed, consisting of the group LG-4, which includes the words 1i1ogta1yui 'and' 1n1ogtaDopaT. Similarly, pre-processing of DO-2 for LG-2 leads to the formation of a set consisting of the group LG-6, which includes the words 'Ma1eDaT and' Ma1epa1z, and processing for LG-3 leads to the formation of a set consisting of the group LG-8, which includes the words 'Mezzade' and 'Mezzadez'. The connections between the sets of words included in LG-4 and LG-1, as well as for the connectives LG-6 - LG-2 and LG-8 - LG-3, are determined by the sign of the use of the Russian-English dictionary for translation.

Кроме того, предусмотренное алгоритмом предварительной обработки использование ДО3 для обработки групп ЛГ-1, ЛГ-2 и ЛГ-3 приводит к получению новых множеств групп. Для ЛГ-1 будет получена группа ЛГ-5, состоящая из слов '1п1огтаДоп' и '1п1огтаДоппеГ, для ЛГ-2 будет получена группа ЛГ-7, состоящая из слов 'Репзе1дтепГ и 'Репзе1дтеп1з' и для ЛГ-3 будет получена группа ЛГ-9, состоящая из слов 'Меезаде' и 'Меззадез'. Связи между множествами слов, входящих в ЛГ-5 и ЛГ-1 , также как и для связок ЛГ -7 - ЛГ -2 и ЛГ-9 - ЛГ-3, определяются признаком использования русско-французского словаря для перевода.In addition, the use of DO3 envisaged by the pre-processing algorithm for processing the LG-1, LG-2, and LG-3 groups leads to new sets of groups. For LG-1, the group LG-5 will be obtained, consisting of the words '1p1ogtaDop' and '1p1ogtaDoppeG, for LG-2, the group LG-7 will be obtained, the words' Repze1dtepG and 'Repze1dtep1z', and for LG-3, the group LG -9, consisting of the words 'Mezadeh' and 'Mezzadez'. The connections between the sets of words included in LG-5 and LG-1, as well as for the connectives LG-7 - LG -2 and LG-9 - LG-3, are determined by the sign of the use of the Russian-French dictionary for translation.

Все формируемые в процессе обработки логические группы включаются в итоговое множество групп, которое после выполнения всех преобразований будет состоять из: ЛГ-1, ЛГ-2, ЛГ-3, ЛГ-4, ЛГ-5, ЛГ-6, ЛГ-7, ЛГ-8 и ЛГ9. Заключительная стадия предварительной обработки запроса (после всех преобразований с использованием дополняющих обработчиков) включает в себя преобразование для каждой из логических групп, входящих в итоговое множество, заключающееся в обработке всех упомянутых логических групп расширяющим обработчиком по типу от группы к слову. Это приводит к формированию множества слов для ка ждой из логических групп. При этом, как уже говорилось выше, полученные для каждой из групп множества слов будут включать в себя слова, имеющие общий морфологический признак.All logical groups formed during processing are included in the final set of groups, which after all the transformations will consist of: LG-1, LG-2, LG-3, LG-4, LG-5, LG-6, LG-7, LG-8 and LG9. The final stage of the preliminary processing of the request (after all the transformations using complementary handlers) includes a transformation for each of the logical groups included in the final set, consisting in processing all the logical groups mentioned by an expanding handler by type from group to word. This leads to the formation of many words for each of the logical groups. Moreover, as mentioned above, the sets of words obtained for each group will include words that have a common morphological characteristic.

Преобразование посредством РО определяет по идентификаторам логических групп все входящие в конкретные группы слова и формирует из них итоговое множество слов, включающее все слова, присущие полученным группам. При этом осуществляется удаление избыточной информации из итогового множества слов, т. е. удаление дублирующих слов, такими словами в данном примере являются слова 'Ιηίοηηαΐίοη', 'Меккаде' и 'Меккадек,' повторяющиеся по два раза. Из оставшихся в итоговом множестве слов впоследствии формируется окончательный запрос путем объединения упомянутых слов булевским оператором 'ОК'.The conversion by means of PO determines by the identifiers of logical groups all the words included in specific groups and forms from them the final set of words, including all the words inherent in the obtained groups. At the same time, redundant information is removed from the final set of words, i.e., duplicate words are deleted, such words in this example are the words 'Ιηίοηηαΐίοη', 'Meccade' and 'Meccadec,' repeated twice. Of the remaining words in the final set of words, a final query is subsequently formed by combining the words with the Boolean operator 'OK'.

Таким образом, для данного примера, окончательный запрос примет вид:Thus, for this example, the final request will take the form:

'Информация ОК Информации ОК Сведения ОК Сведений ОК Сообщение ОК Сообщения ОК ΙηίοπηηΙίοη ОК Ιηίοπηηΐίοηηΐ ОК 1пГогша1юппе1 ОК Ма1епа1 ОК Ма1епа1к ОК Меккаде ОК Меккадек ОК Кеηке^дтеηΐ ОК Кешегдтейк''Information OK Information OK Information OK Information OK Message OK Message OK ΙηίοπηηΙίοη OK Ιηίοπηηοοηηΐ OK 1pGogsha1uppe1 OK Ma1epa1 OK Ma1epa1k OK Mekkade OK Mekkadek OK Keηke ^ dteηΐ OK Keshegdeyteik'

Стоит обратить внимание на тот факт, что поскольку одно и то же слово может присутствовать в нескольких языках, как например, слово 'Меккаде', одинаково употребимое и в английском, и во французском языках, его обработка РО путем преобразования от слова к группе может включать в себя выбор типа РО. Т. е. в том случае, если для рассмотренного примера слово 'Меккаде' будет являться исходным в запросе (вместо слова 'Информация'), перед пользователем при формировании последовательности обработчиков возникнет дилемма, какой из типов РО использовать для обработки слова 'Меккаде' с целью формирования ЛГ-1 (английской морфологии или французской). При этом РО может быть задан по умолчанию, например, когда с изобретением работает англоязычный пользователь, по умолчанию будет задан РО английской морфологии. Однако это не отрицает явного выбора РО другого типа, например, РО французской морфологии. Необходимо отметить, что выбор типа РО осуществляется только при обработке путем преобразования от слова к группе. В случае обратного преобразования - от группы к слову, определение нужного типа РО осуществляется автоматически, поскольку тип РО будет определен по идентификатору конкретной логической группы.It is worth paying attention to the fact that since the same word can be present in several languages, such as the word 'Meccade', which is equally used in both English and French, its processing of RO by converting from a word to a group may include The choice of type RO. That is, if for the considered example the word “Mekkade” will be the source in the request (instead of the word “Information”), a user will be faced with a dilemma when forming a sequence of handlers which type of PO should be used to process the word “Mekkade” with the purpose of the formation of LG-1 (English morphology or French). Moreover, the RO can be set by default, for example, when an English-speaking user works with the invention, the default RO of English morphology will be set. However, this does not deny the explicit choice of another type of RO, for example, RO of French morphology. It should be noted that the choice of the type of PO is carried out only during processing by conversion from a word to a group. In the case of the inverse transformation - from group to word, the determination of the desired type of PO is carried out automatically, since the type of PO will be determined by the identifier of a particular logical group.

Алгоритм предварительной обработки представлен в виде последовательности операций, выполняемых системой, в которой функционирует настоящее изобретение, как это показано на фиг. 4. На этапе 4 определяются исходные данные для проведения предварительной обработки, которые включают в себя опре деление исходного слова для предварительной обработки, количество и последовательность использования дополняющих обработчиков (ДО), а также определение расширяющего обработчика (РО) для обработки исходного слова. В частном случае РО определяется по умолчанию и представляет собой морфологический обработчик. На этапе 5 осуществляют операцию инициализации процесса предварительной обработки, в результате чего формируется итоговое множество слов (пустое), итоговое множество групп (пустое), множества групп для обработки для всех ДО (пустые), а также пустой стек обработчиков (СО). Формат стека обработчиков предполагает хранение информации об идентификаторах ДО и информации об идентификаторах групп, предназначенных для обработки упомянутыми ДО. Далее, на этапе 6, осуществляют получение по заданному на этапе 4 исходному слову логической группы, в которую входит данное слово. Данная операция включает в себя обработку посредством РО, и заключается в преобразовании от слова к группе.The preprocessing algorithm is represented as a sequence of operations performed by the system in which the present invention operates, as shown in FIG. 4. At step 4, the initial data for preliminary processing are determined, which include the determination of the source word for preliminary processing, the number and sequence of use of additional handlers (DO), as well as the definition of the expanding handler (RO) for processing the source word. In a particular case, a PO is defined by default and is a morphological processor. At step 5, the initialization process is initialized, resulting in the formation of the final set of words (empty), the final set of groups (empty), the set of groups for processing for all DOs (empty), as well as an empty stack of processors (CO). The format of the stack of processors involves storing information about identifiers of DOs and information about identifiers of groups intended for processing by the said DOs. Next, at step 6, the acquisition is carried out according to the source word of the logical group specified in step 4, which includes this word. This operation includes processing by means of PO, and consists in converting from a word to a group.

На этапе 6, в соответствии с форматом стека обработчиков, осуществляют занесение в СО информации, включающей в себя идентификаторы всех ДО, связанных с РО (как это показано на фиг. 3, таковым будет являться ДО-1), а также идентификаторы групп, предназначенных для обработки упомянутыми ДО (как это показано на фиг. 3, такой группой для ДО-1 будет являться ЛГ-1 ).At step 6, in accordance with the format of the stack of processors, information is entered into the CO, including identifiers of all DOs associated with the PO (as shown in Fig. 3, such will be DO-1), as well as identifiers of groups intended for processing the mentioned DO (as shown in Fig. 3, such a group for DO-1 will be LG-1).

Кроме того, полученную группу заносят в итоговое множество групп. Под итоговым множеством групп понимают все множество групп, полученных в процессе предварительной обработки дополняющими обработчиками, путем преобразований от группы к группе.In addition, the resulting group is recorded in the final set of groups. The total set of groups is understood to mean the entire set of groups obtained in the process of preliminary processing by complementary processors by means of transformations from group to group.

На этапе 7 осуществляют проверку стека обработчиков. Если стек пустой, то на этапе 8 формируют результирующее множество слов путем преобразования с использованием РО от группы к слову (определение РО для каждой группы на данном этапе осуществляется автоматически). Т.е. из полученного на предыдущих этапах итогового множества групп получают множество слов, из которого исключаются возможные дубли. Затем на этапе 9 осуществляют операцию объединения всех слов, входящих в итоговое множество слов, полученное на этапе 8 булевским оператором 'ОК'.At step 7, the stack of handlers is checked. If the stack is empty, then at step 8, the resulting set of words is formed by conversion using the PO from the group to the word (the definition of the PO for each group at this stage is automatic). Those. from the final set of groups obtained at the previous stages, many words are obtained, from which possible duplicates are excluded. Then, at step 9, the operation of combining all the words included in the final set of words obtained at step 8 by the Boolean operator 'OK' is carried out.

Если СО не пустой, то на этапе 10 обращаются к последнему обработчику, расположенному в стеке (при этом данному обработчику присваивается статус текущего обработчика). На этапе 11 осуществляют обработку всех групп, входящих в множество, предназначенное для обработки текущим обработчиком с использованием преобразования от группы к группе.If the CO is not empty, then at step 10 they turn to the last handler located on the stack (in this case, the handler is assigned the status of the current handler). At step 11, all the groups included in the set intended for processing by the current handler are processed using group-to-group conversion.

На этапе 11, в соответствии с форматом стека обработчиков, осуществляют занесение в СО информации, включающей в себя иденти фикаторы всех ДО, связанных с текущим обработчиком (как это показано на фиг.3, такими обработчиками будут ДО-2 и ДО-3, связанные с текущим обработчиком ДО-1), а также идентификаторы групп, предназначенных для обработки конкретными ДО (как это показано на фиг. 3, стек будет включать в себя информацию о том, что группы ЛГ -1, ЛГ -2 и ЛГ-3 будут обработаны ДО-2, а также информацию о том, что упомянутые группы будут обработаны ДО-3).At step 11, in accordance with the format of the stack of processors, information is entered into the CO, including identifiers of all DOs associated with the current handler (as shown in Fig. 3, such handlers will be DO-2 and DO-3 associated with with the current DO-1 handler), as well as the identifiers of the groups intended for processing by specific DOs (as shown in Fig. 3, the stack will include information that the LG-1, LG-2, and LG-3 groups will be processed DO-2, as well as information that the mentioned groups will be processed DO-3).

Кроме того, множество групп, полученное после обработки текущим ДО, заносят в итоговое множество групп, исключая при этом возможные дубли. Далее, на этапе 12 из стека обработчиков удаляют обработчик, имеющий статус текущего, после чего возвращаются к выполнению этапа 7.In addition, the set of groups obtained after processing by the current DO is recorded in the final set of groups, excluding possible duplicates. Next, at step 12, a handler that has the status of the current one is deleted from the handler stack, and then they return to step 7.

Выполнение последовательности операций, включенных в этапы 7-12 осуществляют до тех пор, пока стек обработчиков не станет пустым. После чего переходят к выполнению этапа 8.The sequence of operations included in steps 7-12 is carried out until the stack of handlers is empty. Then proceed to step 8.

Поскольку пример приводился для запроса, состоящего из одного слова, стоит заметить, что в случае предварительной обработки более сложных запросов, состоящих из нескольких слов, результаты, полученные после предварительной обработки каждого из слов, входящих в исходный запрос, объединяют с использованием булевских операторов. Например, использование оператора 'ΑΝΏ' в случае поиска по группе слов или оператора 'ΝΕΑΚ1 в случае фразового поиска с учетом последовательности слов и интервала между ними.Since the example was given for a query consisting of one word, it is worth noting that in the case of preliminary processing of more complex queries consisting of several words, the results obtained after the preliminary processing of each of the words included in the original query are combined using Boolean operators. For example, the use of the operator 'в' in the case of searching by a group of words or the operator 'ΝΕΑΚ1 in the case of phrasal search, taking into account the sequence of words and the interval between them.

Расширение поискового запроса посредством предварительной обработки способствует более широкому охвату информационного пространства используемого для поиска интересующих данных. При этом итоговый запрос формируют без особого труда с использованием как стандартных обработчиков (словарей и функций преобразования), так и созданных пользователем. Таким образом, предварительная обработка расширяет поисковый запрос различными смысловыми аналогами, что способствует поиску документов, похожих по текстовому содержимому не только в формальном, но и в смысловом значении. Такая функциональная возможность исключительно полезна в тех случаях, когда обрабатываются документы близкой тематики, но имеющие различный словарный состав, например, содержащие слова 'обмен информацией' и 'передача данных'. При обычном поиске (без использования предварительной обработки) упомянутые словосочетания при видимом смысловом сходстве не будут интерпретироваться как имеющие сходство, так как отсутствует формальный признак сходства по словарному составу. В случае использования предварительной обработки с применением словаря синонимов, оба словосочетания будут определены как имеющие сходство по смысловому значению и соответственно при поиске с использованием предварительной обработки будут обнаружены документы, содержащие любое из упомянутых словосочетаний.The extension of the search query through pre-processing contributes to a wider coverage of the information space used to search for the data of interest. At the same time, the final request is formed without much difficulty using both standard handlers (dictionaries and conversion functions), and user-created ones. Thus, pre-processing expands the search query with various semantic analogues, which contributes to the search for documents that are similar in textual content not only in formal but also in semantic meaning. This functionality is extremely useful in cases where documents of a similar subject matter are processed, but having different vocabulary, for example, containing the words 'information exchange' and 'data transfer'. In a conventional search (without using pre-processing), the mentioned phrases with visible semantic similarity will not be interpreted as having similarities, since there is no formal sign of similarity in vocabulary. In the case of using pre-processing using the synonym dictionary, both phrases will be defined as having similarities in meaning and, accordingly, when searching using pre-processing, documents containing any of the mentioned phrases will be found.

Количество используемых обработчиков в данном изобретении не ограничено, но необходимо помнить о том, что использование их большого числа не всегда целесообразно, поскольку может привести к получению результирующих списков, содержащих избыточную информацию.The number of handlers used in this invention is not limited, but it must be remembered that the use of a large number of them is not always advisable, since it can lead to the resulting lists containing redundant information.

Часто используемые в тексте описания изобретения положения и определенияOften used in the text of the description of the invention provisions and definitions

Блоки информации - находящиеся в документе последовательности символов, ограниченные определенными символами. Таким образом, в качестве блоков информации используют слова, последовательности символов, обозначающих даты (например, 21/03/2001), последовательности цифр обозначающих количество чего-либо и т.д., разделенные пробелами, запятыми, точками и другими знаками препинания.Information blocks are sequences of characters in a document that are limited to specific characters. Thus, words, sequences of characters representing dates (for example, 03/21/2001), sequences of numbers representing the amount of something, etc., separated by spaces, commas, periods, and other punctuation marks, are used as information blocks.

Программный модуль обработки документов определенного типа - предназначен для разбиения выбранного фрагмента документа на исходные блоки информации (слова). Конкретный модуль разбивает документы определенного типа (формата) на блоки информации, т.е. представляет собой конвертер для извлечения блоков информации из текстового содержимого документов определенного формата. Использование данного модуля позволяет сравнивать на предмет сходства документы различных форматов, имеющих текстовое содержимое. Например, можно сравнивать текстовое содержимое веб страниц с текстовым содержимым документов формата Мюгокой \Уогб и т.д. Данный модуль может быть сформирован самим пользователем. Изобретение предусматривает также использование стандартных модулей для обработки документов определенного типа. Стандартные модули используются для разбиения на исходные слова документов известных форматов (М1сто8ой ОГйсе, АбоЬе АсгоЬа! и т.д.). Например, для разбиения документов формата М1сгокой \Уогб и их фрагментов на исходные слова будет использован стандартный модуль (конвертер, разработанный компанией Мюгокой).Software module for processing documents of a certain type - designed to break the selected fragment of the document into the original blocks of information (words). A particular module breaks up documents of a certain type (format) into blocks of information, i.e. It is a converter for extracting information blocks from the text content of documents of a certain format. Using this module allows you to compare documents of different formats with text content for similarity. For example, you can compare the text content of web pages with the text content of documents in Myugokoy \ Wogb format, etc. This module can be generated by the user. The invention also provides for the use of standard modules for processing documents of a certain type. Standard modules are used for breaking into source words documents of well-known formats (M1stoy OGyse, Aboye Asgoya, etc.). For example, a standard module (a converter developed by Myugokoy) will be used to split documents of the M1gokoy \ Wogb format and their fragments into source words.

Исходные слова - множество блоков информации, полученное после разбиения конкретного документа программным модулем обработки документов определенного типа.Source words - a lot of blocks of information obtained after breaking a specific document by a software module for processing documents of a certain type.

Множество уникальных слов, входящих в конкретный документ. Формируется из множества исходных слов, входящих в конкретный документ. Например, если какое-либо слово встречается в упомянутом документе несколько раз, в множестве уникальных слов оно будет представлено лишь однажды. Таким образом, множество уникальных слов представляет собой оптимизированное множество исходных слов.Many unique words that make up a particular document. It is formed from the set of source words included in a particular document. For example, if a word appears several times in the mentioned document, it will be presented only once in a set of unique words. Thus, many unique words are an optimized set of source words.

Группа соответствия исходному (или уникальному) слову - представляет собой множест17 во слов, полученное в процессе предварительной обработки конкретного исходного (или уникального) слова. В группу соответствия входят все слова, связанные с исходным (или уникальным) словом соотношениями, определяемыми заданными операциями предварительной обработки.The group of correspondence to the original (or unique) word - is a set of 17 words obtained in the process of preliminary processing of a specific source (or unique) word. The correspondence group includes all words associated with the original (or unique) word by the relations determined by the given pre-processing operations.

Элемент группы соответствия - слово, входящее в конкретную группу соответствия, сформированную в процессе предварительной обработки исходного (или уникального) слова, входящего в поисковый запрос.Correspondence group element - a word included in a specific correspondence group formed in the process of preliminary processing of the initial (or unique) word included in the search query.

Формирование списков документов для групп соответствияCreating lists of documents for compliance groups

Общим для изобретений, включенных в настоящую заявку, является алгоритм формирования списка документов (или их фрагментов) для группы соответствия, связанной с конкретным словом, включенным в поисковый запрос. В дальнейшем при описании алгоритма для простоты будет использован только термин документ, так как полная копия документа является частным случаем фрагмента.Common to the inventions included in this application is an algorithm for generating a list of documents (or fragments thereof) for a correspondence group associated with a specific word included in a search query. In the future, when describing the algorithm, for simplicity, only the term document will be used, since a full copy of the document is a special case of a fragment.

Алгоритм реализуют следующим образом. Получают промежуточный список документов (ПСД) для каждого элемента, входящего в группу соответствия, связанную с конкретным словом. Промежуточные списки документов для каждого из элементов группы соответствия сохраняют. Это связано с тем, что для нескольких слов, входящих в запрос, в процессе предварительной обработки могут быть получены идентичные элементы групп соответствия. Например, для слов 'информация' и 'сообщение' будет получен идентичный элемент - слово 'сведения'. Повторная обработка этого слова с целью формирования для него промежуточного списка при обработке слов 'информация' и 'сообщение' нецелесообразна. Особенно это актуально в тех случаях, когда поисковый запрос состоит из близких по смыслу слов, например, синонимов. Сформировав один раз промежуточный список для элемента группы соответствия, в дальнейшем осуществляют к нему обращение.The algorithm is implemented as follows. Get an intermediate list of documents (PSD) for each element included in the group of correspondence associated with a particular word. The intermediate lists of documents for each of the elements of the correspondence group are saved. This is due to the fact that for several words included in the request, identical elements of correspondence groups can be obtained during the preliminary processing. For example, for the words 'information' and 'message' an identical element will be received - the word 'information'. Reprocessing this word in order to form an intermediate list for it when processing the words 'information' and 'message' is impractical. This is especially true in cases where the search query consists of words that are close in meaning, for example, synonyms. Having once formed an intermediate list for an element of the correspondence group, they are subsequently referred to.

Полученные для каждого элемента группы соответствия списки (ПСД1 - ПСДт, где т количество элементов, входящих в группу) объединяют в общий список документов для конкретной группы соответствия. Т.е., например, объединяют списки для слов 'информация', 'сведения' и 'сообщение', входящих в одну группу. Упомянутый общий список документов (СД) для группы соответствия, связанной с конкретным словом, включенным в поисковый запрос, формируют с использованием булевского оператора 'ОК':The lists obtained for each element of the correspondence group (PSD1 - PSDt, where t is the number of elements included in the group) are combined into a common list of documents for a particular correspondence group. That is, for example, lists are combined for the words 'information', 'information' and 'message' that are in the same group. The mentioned general list of documents (SD) for the correspondence group associated with a specific word included in the search query is formed using the Boolean operator 'OK':

СД = 'ПСД1' ОК 'ПСД2' ОК 'ПСДт'.SD = 'PSD1' OK 'PSD2' OK 'PSDt'.

Таким образом, для конкретной группы соответствия формируют список документов, в которых встречается как минимум один элемент упомянутой группы соответствия. Например, осуществляют формирование списка докумен тов для слова 'информация'. В группу соответствия, полученную для данного слова, входят слова: 'информация', 'сведения' и 'сообщение'. В полученный список документов будут включены документы, содержащие, по меньшей мере, одно из перечисленных слов, входящих в группу соответствия для слова 'информация'. Например, в Документ1 входят слова 'информация' и 'сведения', в Документ2 входит слово 'сведения', в ДокументЗ входит слово 'информация'. Тогда список документов для слова 'информация' будет включать все три документа, причем каждый из документов будет представлен по одному разу (несмотря на наличие в Документ1 двух слов, удовлетворяющих параметрам запроса).Thus, for a particular compliance group, a list of documents is formed in which at least one element of the said compliance group is found. For example, a list of documents for the word 'information' is formed. The correspondence group obtained for this word includes the words: 'information', 'information' and 'message'. The resulting list of documents will include documents containing at least one of the listed words included in the correspondence group for the word 'information'. For example, Document1 includes the words 'information' and 'information', Document2 includes the word 'information', and Document3 includes the word 'information'. Then the list of documents for the word 'information' will include all three documents, and each of the documents will be presented once (despite the presence in Document1 of two words that satisfy the request parameters).

Списки документов, полученные для групп соответствия, сохраняют. Это связано с тем, что для нескольких слов, входящих в запрос, могут быть получены идентичные группы соответствия. Например, для слов 'информация' и 'сообщение' может быть получена идентичная группа соответствия: 'информация', 'сообщение', 'сведения' и 'данные'. Соответственно повторное формирование списка для данной группы нецелесообразно.Lists of documents received for compliance groups are retained. This is due to the fact that for several words included in the query, identical correspondence groups can be obtained. For example, for the words 'information' and 'message', an identical correspondence group can be obtained: 'information', 'message', 'information' and 'data'. Accordingly, re-forming the list for this group is impractical.

Вышеописанный алгоритм применяется для формирования групп соответствия для каждого из слов, включенных в поисковый запрос.The above algorithm is used to form correspondence groups for each of the words included in the search query.

Формирование списка документов для слов, объединенных в запросе оператором 'ОК'Formation of a list of documents for words combined in the query by the 'OK' operator

Общим для изобретений, включенных в настоящую заявку, является также алгоритм формирования списка документов для множества слов, объединенных в поисковом запросе булевским оператором 'ОК'.Common to the inventions included in the present application is also an algorithm for generating a list of documents for a variety of words combined in a search query by the Boolean operator 'OK'.

Формат описания документов, входящих в итоговый список, включает поля 'Идентификатор методики', 'Идентификатор документа', а также поле для подсчета количества содержащихся в документах групп соответствия - 'Количество групп соответствия'. В том случае, если речь идет о формировании списка фрагментов, упомянутый формат описания будет дополнительно включать поле 'Идентификатор фрагмента'. Т.е. формат описания фрагментов в итоговом списке можно представить в виде:The format for the description of documents included in the final list includes the fields 'Methodology identifier', 'Document identifier', as well as the field for counting the number of correspondence groups contained in documents - 'Number of correspondence groups'. In the case when it comes to forming a list of fragments, the above description format will additionally include the 'Fragment identifier' field. Those. the format for describing fragments in the final list can be represented as:

Ю Методики Yu Methods II) Документа II) Document Ю Фрагмента Yu Fragment Количество групп соответствия Number of Compliance Groups

Если фрагментом является полная копия документа, то ГО фрагмента может совпадать с ГО документа. Формирование итогового списка (ИС) документов для слов, объединенных в запросе булевским оператором 'ОК', осуществляют следующим образом.If the fragment is a full copy of the document, then the fragment GO may coincide with the document GO. The formation of the final list (IS) of documents for words combined in the query by the Boolean operator 'OK' is as follows.

На этапе 13, как это показано на фиг. 5, в качестве исходных параметров используют списки документов, сформированные для групп соответствия, связанных со словами, входящими в поисковый запрос. На этапе 14 в итоговый список документов заносят список документов, сформированный для группы соответствия, связанной с первым словом, входящим в поисковый запрос. При этом значение поля для подсчета количества групп соответствия, для всех занесенных в ИС документов, примет значение единицы. Далее, на этапе 15, определяют все ли слова, входящие в поисковый запрос, обработаны. Если нет, то на этапе 16 обращаются к группе соответствия, связанной со следующим словом, входящим в поисковый запрос, и присваивают ей статус текущей. После чего на этапе 17 осуществляют формирование итогового списка документов с использованием правила:At step 13, as shown in FIG. 5, lists of documents generated for correspondence groups associated with words included in a search query are used as input parameters. At step 14, the list of documents generated for the correspondence group associated with the first word in the search query is entered into the final list of documents. In this case, the value of the field for counting the number of correspondence groups, for all documents entered in the IS, will take the value of unity. Next, at step 15, it is determined whether all the words included in the search query have been processed. If not, then at step 16 they turn to the correspondence group associated with the next word in the search query and assign it the current status. Then, at step 17, the final list of documents is generated using the rule:

ИС=ИС' ОК. 'СД текущей группы соответствия'IS = IS 'OK. 'CD of the current compliance group'

Т.е. на этапе 17 в итоговый список документов заносят документы из списка, полученного для конкретной группы соответствия, связанной со словом, включенным в поисковый запрос. На этапе 17 также осуществляют увеличение значения поля для подсчета количества групп соответствия для документов, включенных в ИС. Данная операция осуществляется следующим образом. Например, запрос состоит из двух слов:Those. at step 17, documents from the list obtained for a particular correspondence group associated with the word included in the search query are entered into the final list of documents. At step 17, the field value is also increased to calculate the number of correspondence groups for documents included in the information system. This operation is as follows. For example, a query consists of two words:

первое - 'информация' и второе 'необходимая'. Для каждого из этих слов получена группа соответствия: первая группа 'информация', 'сведения' и 'сообщение'; вторая группа - 'необходимая' и 'нужные'. В архиве имеются четыре документа:the first is 'information' and the second is 'necessary'. For each of these words, a correspondence group is obtained: the first group is 'information', 'information' and 'message'; the second group is 'necessary' and 'necessary'. There are four documents in the archive:

Идентификатор документа Document id Слова, входящие в текстовое содержимое документа Words in the text content of a document ГО Документ1 GO Document1 'необходимая', 'информация', 'нужные', 'сведения' 'necessary', 'information', 'necessary', 'information' ГО Документ2 GO Document2 'нужные', 'сведения' 'needed', 'information' ГО ДокументЗ GO DocumentZ 'информация' 'information' ГО Документ4 GO Document4 'необходимая' 'necessary'

Слова из группы соответствия для первого слова 'информация' входят только в три документа. Поэтому список документов, полученный для группы соответствия, связанной со словом 'информация', будет включать: Документ1, Документ2 и ДокументЗ. Соответственно итоговый список после выполнения этапа 14 можно представить в следующем виде:Words from the matching group for the first word 'information' are included in only three documents. Therefore, the list of documents obtained for the compliance group associated with the word 'information' will include: Document1, Document2 and Document3. Accordingly, the final list after step 14 can be represented as follows:

Идентификатор документа Document id Количество групп соответствия Number of Compliance Groups ГО Документ1 GO Document1 1 one ГО Документ2 GO Document2 1 one ГО ДокументЗ GO DocumentZ 1 one

Слова из группы соответствия для второго слова 'необходимая' также представлены не во всех документах, а лишь в Документ1, Документ и Документ4. Поэтому итоговый список документов после выполнения этапа 17 для группы соответствия, связанной со вторым словом, входящим в поисковый запрос, примет вид:Words from the correspondence group for the second word 'necessary' are also not presented in all documents, but only in Document1, Document and Document4. Therefore, the final list of documents after step 17 for the group of compliance associated with the second word included in the search query will take the form:

ГО Документ2 GO Document2 2 2 ГО ДокументЗ GO DocumentZ 1 one ГО Документ4 GO Document4 1 one

Поскольку Документ1 и Документ2 включают элементы групп соответствия, связанные как с первым, так и со вторым словом, входящим в поисковый запрос, значение поля для подсчета количества групп соответствия, встречающихся в этих документах на этапе 17 примет значение два. Сами упомянутые документы при этом повторно в итоговый список не заносятся. В итоговый список также добавляется новый Документ4, и значение поля для подсчета количества групп соответствия для него принимает значение единицы.Since Document1 and Document2 include elements of correspondence groups associated with both the first and second words included in the search query, the value of the field for counting the number of correspondence groups found in these documents at step 17 will be two. The mentioned documents themselves are not re-entered in the final list. A new Document4 is also added to the final list, and the value of the field for counting the number of correspondence groups for it takes the value of unity.

Этапы 16 и 17 выполняют до тех пор, пока на этапе 15 не определят, что все слова из поискового запроса обработаны. После чего переходят к этапу 18, на котором итоговый список документов для множества слов, объединенных в поисковом запросе булевским оператором 'ОК', считают сформированным.Steps 16 and 17 are performed until at step 15 it is determined that all words from the search query have been processed. Then they proceed to step 18, where the final list of documents for the set of words combined in the search query by the Boolean operator 'OK' is considered to be generated.

Способ поиска похожих по текстовому и/или смысловому содержимому фрагментов в электронных документах, хранимых на устройствах хранения данных Сущность изобретенияThe method of searching for fragments similar in text and / or semantic content in electronic documents stored on data storage devices

Способ заключается в индексировании каждого сохраняемого в архиве документа, разбиении документов на фрагменты и формировании тематик из одного и более фрагмента, определении параметров поиска, проведении поиска и ранжировании полученного в результате поиска списка фрагментов документов в соответствии с определенными параметрами.The method consists in indexing each document stored in the archive, dividing documents into fragments and forming topics from one or more fragments, determining search parameters, conducting a search and ranking the list of document fragments resulting from the search in accordance with certain parameters.

В качестве параметров поиска определяют множество входящих в выбранный фрагмент документа уникальных блоков информации и расширяют его посредством предварительной обработки каждого из упомянутых уникальных блоков информации. При этом под уникальным блоком информации понимают блок информации, встретившийся в выбранном фрагменте документа один и более раз. В качестве предварительной обработки используют операцию получения, по меньшей мере, из одного уникального блока информации, одного или нескольких блоков информации, связанных с уникальным блоком информации заданным соотношением.As search parameters, a plurality of unique information blocks included in a selected fragment of a document is determined and expanded by pre-processing each of said unique information blocks. Moreover, a unique information block is understood to mean a block of information that has occurred one or more times in a selected fragment of a document. As the preliminary processing, the operation of obtaining at least one unique block of information, one or more blocks of information associated with a unique block of information with a given ratio is used.

В качестве блока информации используют находящуюся в документе последовательность символов, ограниченную определенными символами. В качестве фрагмента документа используют любую выбранную последовательность блоков информации, входящую в документ, в том числе полную копию документа.As a block of information using a sequence of characters in the document, limited to certain characters. As a document fragment, any selected sequence of information blocks included in the document is used, including a full copy of the document.

Документ разбивают на фрагменты с использованием, по меньшей мере, одного установленного правила. Множество фрагментов документов, среди которых осуществляют по-The document is divided into fragments using at least one established rule. Many fragments of documents, among which

Идентификатор документа Document id Количество групп соответствия Number of Compliance Groups ГО Документ1 GO Document1 2 2

иск фрагментов, похожих по текстовому и/или смысловому содержимому на выбранный, ограничивают указанием, по меньшей мере, одного правила, посредством которого осуществляли разбиение документов на фрагменты и/или указанием тематики. Тематика представляет собой множество фрагментов, объединенных определенным признаком. В качестве тематики используют, например, раздел классификатора.the lawsuit of fragments similar in text and / or semantic content to the selected one is limited to indicating at least one rule by which documents were divided into fragments and / or subject. The subject is a lot of fragments, united by a certain attribute. As subjects use, for example, the classifier section.

Разбиение документов на фрагменты можно осуществлять в автоматическом режиме. Для разбиения фрагментов документов на множество блоков информации используют программный модуль обработки документов.Splitting documents into fragments can be done automatically. To break fragments of documents into many blocks of information using a software module for processing documents.

Для формирования поискового запроса из множества блоков информации выбирают, по меньшей мере, одну операцию предварительной обработки, причем последовательность выполнения операций предварительной обработки задают пользователи. При этом изобретение в качестве предварительной обработки запроса предусматривает использование одной логической операции тождества. После выполнения операции предварительной обработки исходный блок информации либо удаляют из результирующего запроса, либо оставляют в результирующем запросе.At least one pre-processing operation is selected from a plurality of information blocks to form a search query, and the users specify the sequence of preliminary processing operations. Moreover, the invention as the preliminary processing of the request provides for the use of a single logical operation of identity. After the pre-processing operation is completed, the initial block of information is either removed from the resulting query or left in the resulting query.

Поиск фрагментов документов, похожих по текстовому и/или смысловому содержимому на выбранный, осуществляют с использованием либо всего множества уникальных блоков информации, входящих в выбранный фрагмент документа, либо определенного их количества. Количество уникальных блоков информации, определяемых в качестве параметров для поиска и методику их выбора устанавливают пользователи. Множество уникальных блоков информации определяемое в качестве параметров поиска формируют с использованием одной и более функции, в качестве которых используютThe search for fragments of documents similar in text and / or semantic content to the selected one is carried out using either the whole set of unique information blocks included in the selected fragment of the document, or a certain number of them. The number of unique blocks of information defined as parameters for the search and the methodology for their selection are set by users. Many unique information blocks defined as search parameters are formed using one or more functions, which are used as

- функцию формирования списка уникальных блоков информации для выбранного фрагмента,- the function of forming a list of unique information blocks for the selected fragment,

- функцию определения количества вхождений в текстовое содержимое фрагментов выбранной тематики блоков информации, полученных в процессе предварительной обработки уникального блока информации,- a function for determining the number of occurrences in the text content of fragments of a selected topic of information blocks obtained in the process of preliminary processing of a unique information block,

- функцию определения частоты вхождения уникальных блоков информации в текстовое содержимое фрагментов выбранной тематики, где частоту вхождения исчисляют в процентах от числа вхождений наиболее часто используемого в текстовом содержимом выбранной тематики уникального блока информации.- a function for determining the frequency of occurrence of unique information blocks in the text content of fragments of a selected topic, where the frequency of occurrence is calculated as a percentage of the number of occurrences of the unique information block most often used in the text content of the selected topic.

Кроме того, множество уникальных блоков информации, определяемое в качестве параметров поиска, формируют по определенным пользователем правилам.In addition, many unique blocks of information, defined as search parameters, are formed according to user-defined rules.

Поиск фрагментов документов, соответствующих параметрам поиска, осуществляют на локальных и/или удаленных устройствах хране ния данных. В качестве удаленных устройств используют любой информационный ресурс или систему, предназначенную для поиска данных, функционирующую в компьютерной сети и предоставляющую в ответ на поисковый запрос список фрагментов документов, удовлетворяющих параметрам поиска.The search for fragments of documents corresponding to the search parameters is carried out on local and / or remote data storage devices. As remote devices, use any information resource or system designed to search for data, operating on a computer network and providing, in response to a search query, a list of document fragments satisfying the search parameters.

Полученные результаты поиска (списки фрагментов документов) отображают и осуществляют их ранжирование в соответствии с определенными параметрами. Ранжирование списка фрагментов документов, похожих по текстовому и/или смысловому содержимому на выбранный, осуществляют в соответствии с количеством присутствующих в найденных фрагментах документов групп блоков информации. Упомянутые группы объединяют блоки информации, полученные в процессе предварительной обработки конкретного блока информации. Для тех фрагментов документов, в которых количество групп блоков информации совпадает, осуществляют уточняющую сортировку. При этом для каждого фрагмента осуществляют подсчет количества вхождений в текстовое содержимое фрагментов выбранной тематики, всех присущих фрагменту групп блоков информации. Фрагменты, входящие в полученный в результате поиска список фрагментов, отображают с визуализацией отличий их текстового содержимого от текстового содержимого выбранного фрагмента.The obtained search results (lists of document fragments) display and rank them in accordance with certain parameters. The ranking of the list of fragments of documents similar in text and / or semantic content to the selected one is carried out in accordance with the number of groups of information blocks present in the found fragments of documents. Mentioned groups combine blocks of information obtained in the process of pre-processing a specific block of information. For those fragments of documents in which the number of groups of information blocks coincides, refinement sorting is performed. At the same time, for each fragment, the number of occurrences in the text content of fragments of the selected topic, all the groups of information blocks inherent in the fragment is counted. Fragments included in the list of fragments obtained as a result of the search are displayed with visualization of the differences between their text content and the text content of the selected fragment.

В случае обнаружения в архиве документа, имеющего заданную степень сходства с заносимым в архив документом, заносимый в архив документ сохраняют как новую версию обнаруженного в архиве документа. При этом новую версию сохраняют в виде полной копии заносимого в архив документа или в виде отличий текстового содержимого заносимого в архив документа от текстового содержимого обнаруженного в архиве документа. Фрагменты заносимых в архив документов автоматически классифицируют.If a document is found in the archive that has a specified degree of similarity with the archived document, the archived document is saved as a new version of the document found in the archive. At the same time, the new version is saved as a full copy of the archived document or as differences between the text content of the archived document and the text content of the document found in the archive. Fragments of the archived documents are automatically classified.

Описание изобретенияDescription of the invention

Данный способ предусматривает формирование на устройстве хранения данных архива документов и их фрагментов, в котором все сохраняемые фрагменты документов проиндексированы. Фрагмент документа представляет собой любую выбранную последовательность исходных слов, входящих в документ, и состоит, по меньшей мере, из одного исходного слова. В качестве фрагмента документа используют например, предложение, абзац, параграф или раздел документа и т.д. Частным случаем фрагмента документа является полная копия документа. Индексирование заносимых в архив документов осуществляют в соответствии с установленными правилами. Т.е. при занесении в архив документа определяют фрагменты, на которые он будет разбит и, в соответствии с выбранной методикой разбиения документа, осуществляют его индексирование для дальнейшего индексного поиска фрагментов документов. При этом, независимо от того, будет документ разбит на фрагменты или нет, по умолчанию осуществляют индексирование его полной копии с использованием определенной методики, единой для всех документов. Таким образом, документ в архиве хранится в виде, по меньшей мере, одного фрагмента, представляющего собой его полную копию. Причем фрагменты, представляющие собой полные копии хранимых в архиве документов, проиндексированы с использованием единой методики.This method provides for the formation on the data storage device of an archive of documents and their fragments, in which all stored fragments of documents are indexed. A document fragment is any selected sequence of source words included in a document and consists of at least one source word. As a fragment of a document, for example, a sentence, paragraph, paragraph or section of a document, etc. is used. A special case of a document fragment is a full copy of the document. Indexing of documents recorded in the archive is carried out in accordance with established rules. Those. when a document is entered into the archive, the fragments are determined into which it will be divided and, in accordance with the selected technique for splitting the document, it is indexed for further index search of document fragments. In this case, regardless of whether the document will be divided into fragments or not, by default they index its full copy using a specific technique that is common for all documents. Thus, the document in the archive is stored in the form of at least one fragment, which is a full copy of it. Moreover, fragments representing complete copies of documents stored in the archive are indexed using a single technique.

Определение методики разбиения документа на фрагменты не зависит от количества слов, входящих в заданную последовательность слов, составляющих фрагмент. Например, в одном случае документ разбивается на предложения, в другом случае на абзацы, а в третьем случае фрагмент представляет собой полный документ. Документ одновременно может быть разбит на разные фрагменты, представляющие собой, например, как абзацы, так и предложения. Фрагменты могут пересекаться между собой. При определении фрагмента маркируют область документа, которую определяют в качестве фрагмента. На фиг. 6 схематически показан документ, который разбит на пять различных фрагментов. При этом Фрагмент1 и Фрагмент2 пересекаются, а Фрагмент5 перекрывает ФрагментЗ, Фрагмент4 и часть Фрагмент2. При разбиении документа использована одна методика (Методика1). Это означает, что все упомянутые фрагменты документа будут проиндексированы с использованием одной методики.The definition of a technique for splitting a document into fragments does not depend on the number of words included in a given sequence of words that make up the fragment. For example, in one case the document is divided into sentences, in the other case into paragraphs, and in the third case, the fragment is a complete document. The document can be simultaneously divided into different fragments, which are, for example, both paragraphs and sentences. Fragments may intersect each other. When defining a fragment, mark the area of the document, which is defined as a fragment. In FIG. 6 schematically shows a document, which is divided into five different fragments. In this case, Fragment1 and Fragment2 intersect, and Fragment5 overlaps Fragment3, Fragment4 and part Fragment2. When splitting a document, one technique was used (Method 1). This means that all the mentioned fragments of the document will be indexed using one technique.

На фиг. 7 схематически показан документ, который разбит на четыре фрагмента. Причем при определении Фрагмент6-8 использована Методика1, а при определении Фрагмент8-9 использована Методика2. Соответственно в данном случае фрагменты одного документа будут проиндексированы с использованием различных методик. Один и тот же фрагмент документа (Фрагмент8) может быть определен и проиндексирован с использованием нескольких методик. Формат хранимого в архиве фрагмента документа можно представить в следующем виде:_____________________________________________In FIG. 7 schematically shows a document, which is divided into four fragments. Moreover, when determining Fragment 6-8, Method 1 was used, and when determining Fragment 8-9, Method 2 was used. Accordingly, in this case, fragments of one document will be indexed using various techniques. One and the same fragment of the document (Fragment8) can be defined and indexed using several techniques. The format of the document fragment stored in the archive can be represented as follows: _____________________________________________

II) Методики | II) Документа | II) ФрагментаII) Methods | II) Document | II) Fragment

Такой формат позволяет однозначно идентифицировать любой фрагмент, хранимый в архиве при его поиске с использованием любой методики. Формат хранимого в архиве фрагмента документа может иметь другой вид, например, может быть расширен путем введения дополнительных полей.This format allows you to uniquely identify any fragment stored in the archive when it is searched using any technique. The format of the document fragment stored in the archive may have a different look, for example, it can be expanded by introducing additional fields.

При индексном поиске фрагментов документов указывают методику, с использованием которой будут осуществлять поиск упомянутых фрагментов. Поиск фрагментов документов можно осуществлять с использованием одной и более методик. Например, для поиска фрагментов, похожих по текстовому и/или смысловому содержимому на Фрагмент8 с использованием Методики1, будут задействованы Фрагменты17. Для поиска фрагментов, похожих по текстовому и/или смысловому содержимому на Фрагмент8 с использованием Методики2, будет задействован Фрагмент9. В случае поиска фрагментов, похожих по текстовому и/или смысловому содержимому на Фрагмент8 с использованием обеих методик, будут задействованы Фрагменты1-7 и 9. Изобретение предусматривает установку параметров использования методики по умолчанию. В этом случае по умолчанию поиск ведут по той методике (или методикам) в соответствии с которой формировался выбранный фрагмент. Указание методики позволяет ограничивать область поиска искомых фрагментов документов. Область поиска также можно ограничивать указанием одной и более тематик, среди которых осуществляют поиск фрагментов, похожих по текстовому и/или смысловому содержимому на выбранный.In the index search of document fragments, indicate the technique with which they will search for the mentioned fragments. Search for fragments of documents can be carried out using one or more techniques. For example, to search for fragments similar in textual and / or semantic content to Fragment8 using Methodology1, Fragments17 will be used. To search for fragments similar in textual and / or semantic content to Fragment8 using Methodology2, Fragment9 will be used. In the case of searching for fragments similar in textual and / or semantic content to Fragment8 using both methods, Fragments 1-7 and 9 will be involved. The invention provides for setting default parameters for using the technique. In this case, by default, the search is conducted according to the technique (or techniques) in accordance with which the selected fragment was formed. An indication of the technique allows you to limit the search to the desired document fragments. The search area can also be limited by indicating one or more topics, among which search for fragments similar in text and / or semantic content to the selected one.

Методику разбиения документов на фрагменты определяют пользователи. При этом предусмотрен режим, при котором устанавливают параметры для автоматического разбиения документов на фрагменты, например, на абзацы и т.д. Для этого формируют правила, в соответствии с которыми будет осуществляться разбиение документов на фрагменты. Для разбиения документа на предложения таким правилом будет ограничение последовательности слов в тексте соответствующим знаком препинания, т. е. точкой. Для определения абзацев будет использоваться, например, маркер перевода курсора на новую строку и т.д. Изобретение предусматривает настройку режима, при котором осуществляется самообучение системы для дальнейшего автоматического разбиения документов на фрагменты в соответствии с определенными правилами. Такой режим предусмотрен для разбиения на фрагменты тех документов, формат которых представляет собой упорядоченную последовательность блоков информации. В качестве примера можно привести текст патентной заявки, включающий обязательные разделы: 'Описание изобретения', 'Формула изобретения', 'Реферат' и т.д. Несколько раз обратившись к документу такого типа (являющегося патентной заявкой) и разделив его на фрагменты с использованием конкретной методики, пользователь неявно формирует правила для дальнейшего автоматического разбиения на фрагменты всех текстов патентных заявок. Система 'запоминает' правила разбиения документа (определив в качестве маркеров, например заголовки разделов), и в дальнейшем разбиение патентных заявок осуществляется автоматически без вмешательства пользователя. Изобретение позволяет осуществлять поиск фрагментов, похожих по текстовому и/или смысловому содержимому на выбранный одновременно для одного и более выбранного фрагмента.The methodology for breaking documents into fragments is determined by users. At the same time, a mode is provided in which parameters are set for automatically breaking documents into fragments, for example, paragraphs, etc. To do this, form the rules in accordance with which the division of documents into fragments will be carried out. To split a document into sentences, such a rule would be to limit the sequence of words in the text to the corresponding punctuation mark, that is, a period. For determining paragraphs, for example, a marker for moving the cursor to a new line, etc. will be used. The invention provides for setting a mode in which self-learning of the system is carried out for further automatic splitting of documents into fragments in accordance with certain rules. This mode is provided for breaking into fragments of those documents whose format is an ordered sequence of information blocks. An example is the text of a patent application, which includes the mandatory sections: 'Description of the invention', '' claims ',' 'Abstract', etc. Having addressed several times a document of this type (which is a patent application) and dividing it into fragments using a specific technique, the user implicitly generates rules for further automatic splitting into fragments of all texts of patent applications. The system “remembers” the rules for splitting a document (by defining, for example, section headings as markers), and in the future, patent applications are split automatically without user intervention. The invention allows the search for fragments similar in text and / or semantic content to the one selected simultaneously for one or more selected fragments.

Изобретение предусматривает формирование тематик из одного и более фрагмента. При этом множество фрагментов для формирования конкретной тематики определяет пользователь. Тематика представляет собой, например, множество фрагментов, привязанных к классификатору или к его конкретному разделу. Один и тот же хранимый в архиве документ (и соответственно его фрагмент) может быть привязан к нескольким тематикам.The invention provides for the formation of topics from one or more fragments. At the same time, a lot of fragments for the formation of a specific topic are determined by the user. A topic is, for example, a multitude of fragments tied to a classifier or to its specific section. One and the same document stored in the archive (and, accordingly, its fragment) can be tied to several topics.

Способ реализуют следующим образом. На этапе 19, как это показано на фиг. 8, осуществляют разбиение текста выбранного фрагмента документа на исходные блоки информации (исходные слова). Выбранным фрагментом документа является тот фрагмент, для которого осуществляют поиск похожих с ним по текстовому и/или смысловому содержимому фрагментов документов. Для разбиения выбранного фрагмента документа на исходные слова используют программный модуль обработки документов определенного типа. Из полученного множества исходных слов формируют множество уникальных слов, присущих выбранному фрагменту документу. Множество уникальных слов представляет собой оптимизированное множество исходных слов. Например, если какое-либо слово встречается в выбранном фрагменте несколько раз, в множестве уникальных слов оно будет представлено лишь однажды.The method is implemented as follows. At step 19, as shown in FIG. 8, the text of the selected fragment of the document is partitioned into source blocks of information (source words). The selected fragment of the document is that fragment for which a search is made for similar fragments of documents to the text and / or semantic content of the documents. To break the selected fragment of the document into source words, a software module for processing documents of a certain type is used. From the obtained set of source words, many unique words are generated that are inherent in the selected fragment of the document. Many unique words are an optimized set of source words. For example, if a word occurs several times in a selected fragment, it will be presented only once in a set of unique words.

Далее на этапе 20 осуществляют предварительную обработку каждого уникального слова из полученного на этапе 19 множества уникальных слов для выбранного фрагмента. Таким образом, на этапе 20 осуществляют формирование групп соответствия для каждого уникального слова. Предварительная обработка уникальных слов расширяет поисковый запрос различными смысловыми аналогами, что способствует поиску фрагментов документов, похожих по текстовому содержимому не только в формальном, но и в смысловом значении.Next, at step 20, each unique word is pre-processed from the set of unique words obtained at step 19 for the selected fragment. Thus, at step 20, matching groups are formed for each unique word. Pre-processing of unique words expands the search query with various semantic analogues, which helps to search for fragments of documents similar in text content not only in formal but also in semantic meaning.

На этапе 21 определяют, будут ли для поиска фрагментов документов, похожих по текстовому и/или смысловому содержимому на выбранный, использовать все уникальные слова ,полученные на этапе 19. Список уникальных слов, используемых для поиска, может включать либо все уникальные слова, полученные на этапе 19, либо определенное их количество. Если на этапе 21 определяют, что будут использовать не все уникальные слова, а определенное их количество, то переходят к выполнению этапа 22. На этом этапе определяют, будут ли для определения множества уникальных слов использовать определенные пользователем правила. Если да, то переходят к выполнению этапа 23, на котором формируют список уникальных слов для поиска по правилам, установленным пользователем. Такими правилами могут быть, например, формирование множества уникальных слов для поиска из множества уникальных слов, полученных на этапе 19 с количеством символов, не превышающим заданное пользователем число. Для формирования множества уникальных слов для поиска пользователь может устанавливать и более сложные правила. Например, определение для поиска тех слов, частота вхождения которых в выбранный фрагмент находится в определенном интервале. Кроме того, список слов для поиска может быть задан путем ручного выбора слов из списка уникальных слов, полученного на этапе 19.At step 21, it is determined whether to search for fragments of documents that are similar in text and / or semantic content to the selected one, use all the unique words obtained at step 19. The list of unique words used for search can include either all unique words obtained at stage 19, or a certain number of them. If at step 21 it is determined that not all unique words will be used, but a certain number of them, then proceed to step 22. At this stage, it is determined whether user-defined rules will be used to determine the set of unique words. If yes, then proceed to step 23, which form a list of unique words for searching according to the rules established by the user. Such rules may be, for example, the formation of many unique words to search from the many unique words obtained in step 19 with the number of characters not exceeding a number specified by the user. To form many unique words for search, the user can establish more complex rules. For example, the definition to search for those words whose frequency of occurrence in the selected fragment is in a certain interval. In addition, the list of words to search can be specified by manually selecting words from the list of unique words obtained in step 19.

Если на этапе 22 определяют, что для формирования множества уникальных слов для поиска не будут использовать определенные пользователем правила, то переходят к выполнению этапа 24. На этом этапе осуществляют выбор одной и более функций, предусмотренных настоящим изобретением для определения множества уникальных слов, которые будут использованы для поиска. В качестве функций, посредством которых осуществляют формирование множества уникальных слов для поиска, используютIf it is determined at step 22 that user-defined rules will not be used to generate the set of unique words for searching, then proceed to step 24. At this stage, one or more functions provided by the present invention are selected to determine the set of unique words that will be used for searching. As functions by which the formation of many unique words for the search is carried out, use

- функцию формирования списка уникальных слов для выбранного фрагмента,- the function of forming a list of unique words for the selected fragment,

- функцию подсчета количества вхождений уникальных слов в текстовое содержимое фрагментов выбранной тематики,- a function of counting the number of occurrences of unique words in the text content of fragments of a selected topic,

- функцию определения частоты вхождения уникальных слов в текстовое содержимое фрагментов выбранной тематики.- a function for determining the frequency of occurrence of unique words in the text content of fragments of a selected topic.

Использование упомянутых функций актуально в тех случаях, когда в качестве выбранного фрагмента используют полную копию документа. Если при поиске в качестве выбранного фрагмента используют относительно небольшой фрагмент и поиск имеющих с ним сходство фрагментов осуществляют в архиве, использование в запросе всего множества уникальных слов из выбранного фрагмента не создает проблем. Если же в запросе используют все множество уникальных слов, полученное для выбранной полной копии достаточно большого документа, то скорость формирования итогового списка фрагментов снижается. Еще более критичен показатель скорости формирования итогового списка при поиске фрагментов на удаленных устройствах. Поэтому в ряде случаев для формирования множества уникальных слов для поиска гораздо удобнее воспользоваться следующими функциями.The use of the mentioned functions is relevant in those cases when a full copy of the document is used as the selected fragment. If a relatively small fragment is used as the selected fragment during the search and fragments having similarities are searched for in the archive, using the entire set of unique words from the selected fragment in the query does not create problems. If the request uses the entire set of unique words obtained for the selected full copy of a sufficiently large document, then the speed of forming the final list of fragments is reduced. Even more critical is the rate of the formation of the final list when searching for fragments on remote devices. Therefore, in some cases, it is much more convenient to use the following functions to form many unique words for search.

Функция формирования списка уникальных слов для выбранного фрагментаThe function of forming a list of unique words for the selected fragment

Данная функция (Ф1) предназначена для формирования множества уникальных слов, присущих выбранному фрагменту.This function (F1) is intended for the formation of many unique words inherent in the selected fragment.

Функция подсчета количества вхождений уникальных слов в выбранную тематикуThe function of counting the number of occurrences of unique words in the selected subject

Данная функция (Ф2) реализована следующим образом. Имеется уникальное слово, причем для него сформирована группа соответствия, полученная в процессе предварительной обработки. Для каждого элемента группы соответствия, связанной с уникальным словом, определяют количество его вхождений в текстовое содержимое фрагментов выбранной тематики (или в текстовое содержимое всех документов хранимых в архиве). Затем суммируют показатели количества вхождений всех элементов групп соответствия, составляющих конкретную группу. Например, в процессе предварительной обработки уникального слова 'информация' получают группу соответствия, включающую множество элементов: 'информация', 'сведения' и 'сообщение'. Далее для каждого элемента осуществляют подсчет количества вхождений в текстовое содержимое выбранной тематики: 'информация' - десять, 'сведения' - пять, 'сообщение' - три. Таким образом, данная функция определяет, что общее количество вхождений в выбранную тематику всех элементов составляющих группу соответствия, полученную для уникального слова 'информация', равно восемнадцати.This function (Ф2) is implemented as follows. There is a unique word, and for it a correspondence group is formed, obtained during the preliminary processing. For each element of the correspondence group associated with a unique word, the number of its occurrences in the text content of fragments of the selected topic (or in the text content of all documents stored in the archive) is determined. Then summarize indicators of the number of occurrences of all elements of the correspondence groups that make up a particular group. For example, in the process of pre-processing the unique word 'information', a correspondence group is obtained that includes many elements: 'information', 'information' and 'message'. Then, for each element, the number of entries in the text content of the selected topic is counted: 'information' - ten, 'information' - five, 'message' - three. Thus, this function determines that the total number of entries in the selected topic of all the elements that make up the correspondence group obtained for the unique word 'information' is eighteen.

Функция определения частоты вхождений уникальных слов в выбранную тематику Данная функция (Ф3) реализована следующим образом. Берется уникальное слово и для него определяют частоту вхождения в текстовое содержимое фрагментов выбранной тематики. Частота вхождения исчисляется в процентах от количества вхождений уникального слова, наиболее часто используемого в текстовом содержимом фрагментов выбранной тематики. Например, берется уникальное слово 'информация', для которого необходимо определить частоту вхождения в текстовое содержимое фрагментов выбранной тематики. Определяют, что в текстовом содержимом фрагментов выбранной тематики наиболее часто встречается уникальное слово 'поиск' и общее количество вхождений этого слова - 100 раз. Данный показатель считают за 100%. При этом количество вхождений в выбранную тематику уникального слова 'информация' равно 30 и это означает, что частота его вхождений равна 30%.The function of determining the frequency of occurrence of unique words in the selected topic This function (Ф3) is implemented as follows. A unique word is taken and for it the frequency of occurrence of fragments of the selected topic in the text content is determined. The frequency of occurrence is calculated as a percentage of the number of occurrences of a unique word, most often used in the textual content of fragments of a selected topic. For example, a unique word 'information' is taken, for which it is necessary to determine the frequency of occurrence of fragments of the selected topic in the text content. It is determined that in the textual content of fragments of a selected topic, the unique word 'search' is most often found and the total number of occurrences of this word is 100 times. This indicator is considered 100%. Moreover, the number of entries in the selected topic of the unique word 'information' is 30, and this means that the frequency of its occurrences is 30%.

Манипулирование вышеописанными функциями позволяет пользователю создавать различные правила (логики) для определения множества уникальных слов, используемых в качестве параметров поиска. Приведем три возможных логики, каждая из которых может использоваться в зависимости от цели поисковой операции.Manipulating the above functions allows the user to create various rules (logics) to determine the set of unique words used as search parameters. Here are three possible logics, each of which can be used depending on the purpose of the search operation.

Первая логикаFirst logic

Данная логика позволяет осуществлять выбор установленного количества слов из списка уникальных слов, полученного на этапе 19 и упорядоченного в соответствии с общим количеством вхождений в выбранную тематику (или весь архив) всех слов, полученных в процессе предварительной обработки уникального слова.This logic allows you to select a set number of words from a list of unique words obtained in step 19 and ordered in accordance with the total number of entries in the selected topic (or the entire archive) of all words obtained during the preliminary processing of a unique word.

Данная логика предполагает использование функции формирования списка уникальных слов для выбранного фрагмента и функции подсчета количества вхождений уникальных слов в выбранную тематику.This logic involves the use of the function of forming a list of unique words for the selected fragment and the function of counting the number of occurrences of unique words in the selected subject.

Для выбранного фрагмента определяют список уникальных слов (Ф1) и для каждого из этих уникальных слов осуществляют подсчет количества вхождений в выбранную тематику (использование Ф2). Далее список сортируют в порядке возрастания количества вхождений уникальных слов в выбранную тематику. Т. е. на первом месте в полученном списке будет располагаться уникальное слово с наименьшим числом вхождений (наиболее редкое для выбранной тематики). После этого из упорядоченного списка уникальных слов по определенным пользователем правилам формируют множество слов для поиска. Такими правилами является указание интервала для выбора слов. Интервал для выбора слов может быть любым. Если список включает в себя пятьдесят уникальных слов и при этом выбирают двадцать слов, то можно задать, например, следующие интервалы: с первого по двадцатое слово, с одиннадцатого по тридцатое, с тридцать первого по пятидесятое и т.д. При формировании множества уникальных слов для поиска с использованием первой логики можно указать процентный показатель, например, выбрать из списка 30% наиболее редких слов и т. д.For the selected fragment, a list of unique words (F1) is determined, and for each of these unique words, the number of entries in the selected topic is calculated (using Ф2). Next, the list is sorted in ascending order of the number of occurrences of unique words in the selected subject. That is, in the first place in the list will be a unique word with the least number of occurrences (the rarest for the selected topic). After that, from an ordered list of unique words according to user-defined rules, a lot of words are formed for search. Such rules are the indication of the interval for choosing words. The interval for choosing words can be any. If the list includes fifty unique words and twenty words are selected, then you can set, for example, the following intervals: from the first to the twentieth word, from the eleventh to the thirtieth, from the thirty-first to the fiftieth, etc. When forming a set of unique words for searching using the first logic, you can specify a percentage indicator, for example, choose 30% of the rarest words from the list, etc.

Основная задача данной логики - выбор для поиска установленного количества наиболее редких или наиболее часто используемых в выбранной тематике уникальных слов.The main objective of this logic is to select a set of the most rare or most often used unique words for the selected topic to search for.

Вторая логикаSecond logic

Данная логика позволяет из множества уникальных слов, полученных для выбранного фрагмента, выбрать слова, входящие в выбранную тематику (или весь архив) с заданной частотой. Данная логика предполагает использование функции формирования списка уникальных слов для выбранного фрагмента и функции определения частоты вхождений уникальных слов в выбранную тематику.This logic allows you to select the words included in the selected subject (or the entire archive) with a given frequency from the set of unique words received for the selected fragment. This logic involves the use of the function of forming a list of unique words for the selected fragment and the function of determining the frequency of occurrence of unique words in the selected subject.

Для выбранного фрагмента формируют множество уникальных слов (Ф1) и для каждого из этих уникальных слов определяют частоту его вхождения в выбранную тематику (использование Ф3). После этого из всего множества уникальных слов, полученных для выбранного фрагмента, осуществляют выбор уникальных слов по определенным пользователем правилам. В качестве правил указывают интервал частоты вхождения, в соответствии с которым формируют список уникальных слов для поиска или количество слов, имеющих наименьший или наоборот набольший показатель частоты вхождения. Для примера, рассмотренного при описании функции определения частоты вхождений, в случае задания интервала 25-30% для поиска будет отобрано слово 'информация'.For a selected fragment, a lot of unique words (F1) are formed and for each of these unique words the frequency of its occurrence in the selected subject is determined (using F3). After that, from the whole set of unique words obtained for the selected fragment, the selection of unique words is carried out according to user-defined rules. As the rules indicate the interval of frequency of occurrence, in accordance with which form a list of unique words to search, or the number of words having the smallest or vice versa the highest rate of occurrence. For the example considered in the description of the function of determining the frequency of occurrences, in the case of setting an interval of 25-30%, the word 'information' will be selected for the search.

Использование уникальных слов с заданным диапазоном частоты вхождения в отличие от первой логики позволяет более гибко подходить к формированию поискового запроса. Указание количества слов (первая логика), даже наиболее редко входящих в выбранную тематику, не всегда позволяет сформировать оптимальный запрос. Например, в множество из установленных двадцати наиболее редких слов могут войти слова с частотой вхождения 5% (первое слово) и 30% (двадцатое слово). Т.е. разрыв между показателями частоты вхождения обоих слов слишком велик. Использование второй логики позволяет устранить эту проблему.Using unique words with a given range of occurrence frequencies, unlike the first logic, allows a more flexible approach to the formation of a search query. Indication of the number of words (the first logic), even the most rarely included in the selected topic, does not always allow the formation of an optimal query. For example, many of the twenty most rare words identified may include words with a frequency of occurrence of 5% (first word) and 30% (twentieth word). Those. the gap between the frequency of occurrence of both words is too large. Using the second logic eliminates this problem.

Третья логикаThird logic

Данная логика позволяет формировать для поиска множество уникальных слов с наибольшим количеством вхождений в выбранную тематику и при этом имеющих наименьшее количество вхождений в весь хранимый архив за исключением выбранной тематики. Данная логика предполагает использование функции формирования списка уникальных слов для выбранного фрагмента и функции определения частоты вхождений уникальных слов в выбранную тематику.This logic allows you to create a set of unique words for search with the largest number of entries in the selected topic and at the same time having the least number of entries in the entire stored archive except for the selected topic. This logic involves the use of the function of forming a list of unique words for the selected fragment and the function of determining the frequency of occurrence of unique words in the selected subject.

Формируют список уникальных слов для выбранного фрагмента (Ф1). После чего для каждого из полученных уникальных слов определяют частоту его вхождения в текстовое содержимое фрагментов выбранной тематики (Ф3). Далее последовательно обращаются к каждому уникальному слову из полученного списка и осуществляют определение частоты его вхождения в весь архив за исключением выбранной тематики (Ф3). Таким образом, для каждого уникального слова определяют два показателя: частоту вхождения в выбранную тематику (Ч1) и частоту вхождения в весь архив за исключением выбранной тематики (Ч2). Затем для каждого уникального слова определяют разницу показателей Ч1 и Ч2 (Р=Ч1-Ч2), после чего список уникальных слов сортируют в порядке убывания показателя Р. Из сформированного списка для запроса выбирают установленное количество уникальных слов, имеющих наибольший показатель Р.Form a list of unique words for the selected fragment (F1). After that, for each of the obtained unique words, the frequency of its occurrence in the text content of fragments of the selected topic (F3) is determined. Next, they turn to each unique word from the resulting list in sequence and determine the frequency of its occurrence in the entire archive with the exception of the selected topic (F3). Thus, for each unique word, two indicators are determined: the frequency of entry into the selected topic (P1) and the frequency of entry into the entire archive except for the selected topic (P2). Then, for each unique word, the difference between the indicators Ch1 and Ch2 (P = Ch1-Ch2) is determined, after which the list of unique words is sorted in descending order of indicator P. From the generated list for the query, the specified number of unique words having the highest indicator R. is selected.

Преимущество данной логики заключается в том, что она позволяет осуществлять поиск по словам, являющимся ключевыми для интересующей пользователя тематики. Например, выбранная тематика представляет собой раздел классификатора 'Инвестиции'. Разумеется, слово 'Инвестиции' и его словоформы будут встречаться в текстовом содержимом выбранной тематики довольно часто и, например, имеют показатель Ч1=80%. Аналогичный показатель (Ч1=80%) будет иметь и уникальное слово 'могут', использование которого в запросе не желательно, т. к. оно является общеупотребимым. Показатель частоты вхождения слова 'инвестиции' в весь архив за исключением вы бранной тематики будет достаточно низким, например, Ч2=1 %. Чего нельзя сказать об общеупотребимом слове 'могут', показатель Ч2 которого равен, например, 10%. Таким образом, показатель Р для слова 'инвестиции' будет равен 80-1=79, а показатель Р для слова 'могут' будет равен 80-10=70. В запрос будет включено слово 'инвестиции', как имеющее наиболее высокий показатель Р. В случае возможного равенства показателя Р осуществляют уточняющую сортировку в соответствии с параметрами, определенными пользователем. Например, предпочтение отдают уникальному слову, имеющему больший показатель Ч1 или слову, имеющему меньший показатель Ч2.The advantage of this logic is that it allows you to search for words that are key to the topic of interest to the user. For example, the selected topic is a section of the classifier 'Investments'. Of course, the word 'Investments' and its word forms will appear quite often in the textual content of the chosen topic and, for example, have an indicator of Ch1 = 80%. A similar indicator (Ch1 = 80%) will have a unique word 'can', the use of which is not desirable in the request, since it is commonly used. The rate of occurrence of the word 'investment' in the entire archive, except for the selected subject, will be quite low, for example, Ch2 = 1%. What can not be said about the commonly used word 'can', the indicator of which is equal to, for example, 10%. Thus, the indicator P for the word 'investment' will be 80-1 = 79, and the indicator P for the word 'may' will be 80-10 = 70. The query will include the word 'investment', as having the highest indicator P. In the event of a possible equality of indicator P, a refinement sorting is carried out in accordance with the parameters defined by the user. For example, preference is given to a unique word that has a higher indicator of Ch1 or a word that has a lower indicator of Ch2.

Таким образом, данная логика позволяет исключить из запроса общеупотребимые слова и сформировать запрос из множества уникальных слов, наиболее точно характеризующих текстовое содержимое выбранной тематики. Использование данной логики актуально в тех случаях, когда в архиве хранится большое количество документов, привязанных к соответствующим тематикам. Использование данной логики не ограничивается выбором установленного количества уникальных слов, имеющих наибольший показатель Р. Для поиска могут быть выбраны уникальные слова, показатель Р которых находится в заданном диапазоне.Thus, this logic allows us to exclude common words from the query and form a query from a variety of unique words that most accurately characterize the text content of the selected topic. The use of this logic is relevant in cases where a large number of documents are stored in the archive that are tied to relevant topics. The use of this logic is not limited to the choice of a set number of unique words that have the largest index P. For search, unique words can be selected whose index P is in a given range.

Необходимо заметить, что количество логик не ограничивается описанными выше тремя логиками, т. к. пользователь может использовать другие комбинации функций для создания новых логик.It should be noted that the number of logics is not limited to the three logics described above, because the user can use other combinations of functions to create new logics.

После выполнения этапов 23, 24 или если на этапе 21 определяют, что для поиска будет использовано все множество уникальных слов, переходят к выполнению этапа 25, на котором формируют итоговый список уникальных слов для участия в поисковом запросе. После формирования итогового списка уникальных слов для участия в поисковом запросе переходят к выполнению этапа 26. На данном этапе осуществляют формирование итогового списка документов для множества слов, полученного на этапе 25. Список формируют с использованием алгоритма формирования списка документов (или их фрагментов) для множества слов, объединенных в поисковом запросе булевским оператором 'ΘΚ.'.After performing steps 23, 24, or if at step 21 it is determined that the whole set of unique words will be used for the search, they proceed to step 25, where a final list of unique words for participation in the search query is formed. After forming the final list of unique words for participating in the search query, they proceed to step 26. At this stage, the final list of documents for the set of words obtained at step 25 is formed. The list is generated using the algorithm for generating a list of documents (or fragments thereof) for many words combined in the search query by the boolean operator 'ΘΚ.'.

Поиск фрагментов, соответствующих параметрам запроса, осуществляют как на локальных, так и на удаленных устройствах хранения данных. Изобретение позволяет осуществлять поиск фрагментов документов, соответствующих сформированному запросу одновременно на локальных и удаленных устройствах. При этом в качестве удаленных устройств используют любой информационный ресурс или систему, предназначенную для поиска данных и функционирующую в компьютерной сети.The search for fragments corresponding to the query parameters is carried out both on local and remote data storage devices. The invention allows the search for fragments of documents corresponding to the generated request simultaneously on local and remote devices. Moreover, any information resource or system designed to search for data and operating on a computer network is used as remote devices.

При поиске фрагментов на удаленных устройствах для каждого элемента группы соответствия формируют промежуточный список фрагментов, полученный от указанных пользователем удаленных устройств. При этом список фрагментов, полученный от удаленных устройств, может представлять собой как список полных копий документов (ссылок на них), так и список их фрагментов. Разбиение полных копий документов, полученных от удаленных устройств, на фрагменты осуществляют по формальным правилам, которые могут быть заданы, например, для документов, получаемых от конкретных устройств. Пример разбиения документов, представляющих собой патентные заявки, был описан выше. Аналогичным образом можно разбивать на фрагменты документы другого типа, например, электронные письма с рассылкой новостей и т.д. Если разбиение документов, получаемых от удаленных устройств, на фрагменты невозможно, то фрагментом считаются полные копии этих документов.When searching for fragments on remote devices, an intermediate list of fragments obtained from user-specified remote devices is formed for each element of the correspondence group. In this case, the list of fragments received from remote devices can be a list of full copies of documents (links to them), as well as a list of fragments thereof. The splitting of full copies of documents received from remote devices into fragments is carried out according to formal rules, which can be set, for example, for documents received from specific devices. An example of a breakdown of documents representing patent applications has been described above. Similarly, documents of a different type can be broken up into fragments, for example, newsletter emails, etc. If it is not possible to split documents received from remote devices into fragments, then full copies of these documents are considered a fragment.

После того как итоговый список фрагментов, похожих по текстовому и/или смысловому содержимому на выбранный, будет сформирован, на этапе 26 осуществляют его упорядочивание и отображение.After the final list of fragments similar in text and / or semantic content to the selected one is formed, at step 26 it is ordered and displayed.

Список на этапе 26 упорядочивают в соответствии с количеством присутствующих в полученных фрагментах групп соответствия. При этом в конкретную группу соответствия входят все слова, полученные в процессе предварительной обработки уникального слова, входящего в выбранный фрагмент и включенного в запрос. Например, уникальным словом является слово 'информация'. Для предварительной обработки используют морфологический РО и два ДО - словарь синонимов и русско-английский словарь для перевода. Тогда в группу соответствия, сформированную для слова 'информация' будут входить, например, элементы: 'информация', 'информации', 'сведения', 'сведений', 'сообщение', 'сообщений', 'ίπίοηηαΐίοη' и т.д. В случае любого количества вхождений одного и более упомянутых элементов во фрагмент считается, что в нем присутствует одна группа соответствия. Таким образом, осуществляют сортировку списка в порядке убывания количества входящих во фрагменты групп соответствия. В список включены четыре фрагмента:The list in step 26 is sorted according to the number of matching groups present in the fragments obtained. In this case, a specific correspondence group includes all words obtained in the process of preliminary processing of a unique word included in the selected fragment and included in the request. For example, a unique word is the word 'information'. For pre-processing, a morphological RO and two DOs are used - a dictionary of synonyms and a Russian-English dictionary for translation. Then, the correspondence group formed for the word 'information' will include, for example, elements: 'information', 'information', 'information', 'information', 'message', 'messages', 'ίπίοηηαΐίοη', etc. In the case of any number of occurrences of one or more of the mentioned elements in a fragment, it is considered that there is one correspondence group in it. Thus, the list is sorted in descending order of the number of matching groups included in the fragments. The list includes four fragments:

ГО фрагмента GO fragment Слова, входящие во фрагмент Words in a fragment Количество групп во фрагменте The number of groups in the fragment Кол-во вхождений групп в тематику Number of entries groups in the subject ГО Фрагмент! GO Fragment! 'необходимая', 'нужные', 'информация', 'сведения', 'поиск', 'искать' 'necessary', 'necessary', 'information', 'information', 'search', 'search' II) Группа1 /необходимая, нужные/ II) Group 1 / necessary, necessary / 7 7 ГО Группа2 /информация, сведения/ GO Group2 / information, information / 10 10 ГО ГруппаЗ /поиск, искать/ GO GroupZ / search, search / 5 5

ГО Фрагмент2 GO Fragment2 'информация', 'сведения', 'поиск', 'искать' 'information', 'information', 'search', 'search' ГО Группа2 /информация, сведения/ GO Group2 / information, information / 10 10 ГО ГруппаЗ /поиск, искать/ GO GroupZ / search, search / 5 5 ГО ФрагментЗ GO FragmentZ 'поиск', 'искать', 'необходимая' 'нужные' 'search', 'search', 'necessary' 'necessary' ГО Группа1 /необходимая, нужные/ GO Group1 / necessary, necessary / 7 7 ГО Группа2 /информация, сведения/ GO Group2 / information, information / 10 10 ГО Фрагмент4 GO Fragment4 'информация' 'information' ГО Группа2 /информация, сведения/ GO Group2 / information, information / 10 10

На первом месте в списке расположен Фрагмент1, поскольку в него входят слова из трех групп соответствия. Далее идут два фрагмента с одинаковым количеством групп соответствия - Фрагмент2 и ФрагментЗ (по две группы). Замыкает список Фрагмент4 с одной группой соответствия.Fragment1 is in the first place in the list, since it includes words from three correspondence groups. Next are two fragments with the same number of matching groups - Fragment2 and FragmentZ (two groups each). Closes the Fragment4 list with one matching group.

Далее осуществляют уточняющую сортировку итогового списка для тех фрагментов, в которых количество групп соответствия одинаково: Фрагмент 2 (Группа2, ГруппаЗ) и ФрагментЗ (Группа1, Группа2). Уточняющая сортировка итогового списка заключается в определении общего количества вхождений групп соответствия, присущих упомянутым фрагментам в текстовое содержимое фрагментов выбранной тематики. Приоритет присваивают фрагменту с наибольшим показателем количества вхождений групп соответствия в выбранную тематику. Поскольку количество вхождений групп соответствия в текстовое содержимое фрагментов выбранной тематики для Фрагмент2 равно 15 (10+5), а для ФрагментЗ равно 17 (7+10), окончательный вид итогового списка будет таков: на первом месте Фрагмент1, на втором ФрагментЗ, далее Фрагмент2, и замыкает список Фрагмент4.Then, a refinement sorting of the final list is carried out for those fragments in which the number of matching groups is the same: Fragment 2 (Group2, Group3) and FragmentZ (Group1, Group2). The refinement of the final list consists in determining the total number of occurrences of the correspondence groups inherent in the mentioned fragments in the text content of fragments of the selected subject. Priority is assigned to the fragment with the highest indicator of the number of occurrences of the correspondence groups in the selected topic. Since the number of occurrences of matching groups in the text content of fragments of the selected topic for Fragment2 is 15 (10 + 5), and for FragmentZ equal to 17 (7 + 10), the final form of the final list will be as follows: in the first place Fragment1, in the second FragmentZ, then Fragment2 , and closes the list Fragment4.

ГО фрагмента GO fragment Количество групп во фрагменте The number of groups in the fragment Количество вхождений групп в тематику The number of occurrences of groups in the subject ГО Фрагмент! GO Fragment! З 3 22 22 ГО ФрагментЗ GO FragmentZ 2 2 17 17 ГО Фрагмент2 GO Fragment2 2 2 15 fifteen ГО Фрагмент4 GO Fragment4 1 one 10 10

После уточняющей сортировки Фрагмент2 и ФрагментЗ в итоговом списке поменялись местами. В случае равенства показателей количества вхождений в выбранную тематику групп соответствия, для дополнительной сортировки используют, например, весовые коэффициенты, присвоенные группам соответствия. В случае равенства всех показателей приоритет может быть присвоен по дате последнего обращения к фрагменту и т.д.After the refinement sorting, Fragment2 and FragmentZ in the final list are swapped. In case of equality of indicators of the number of occurrences in the selected topic of the correspondence groups, for additional sorting, for example, weights assigned to the correspondence groups are used. In case of equality of all indicators, priority can be assigned by the date of the last access to the fragment, etc.

Далее на этапе 26 итоговый список фрагментов отображают пользователю в соответствии с установленными параметрами. В качестве параметров отображения используют, например, минимальное значение показателя степени сходства, при котором фрагмент, входящий в итоговый список отображают. Установка параметров для отображения итогового списка фрагментов включает в себя также ограничение количества отображаемых фрагментов. В этом случае отображают итоговый список, в котором содержится определенное количество фрагментов, имеющих наибольшую степень сходства с выбранным. Установка параметров отображения включает в себя также комбинированный метод. Т. е. в качестве параметров отображения одновременно используют как показатель степени сходства, так и ограничение количества отображаемых фрагментов.Next, at step 26, the final list of fragments is displayed to the user in accordance with the set parameters. As display parameters, for example, the minimum value of the degree of similarity index is used, at which a fragment included in the final list is displayed. Setting parameters for displaying the final list of fragments also includes limiting the number of displayed fragments. In this case, the final list is displayed, which contains a certain number of fragments that have the greatest degree of similarity with the selected one. Setting display options also includes a combined method. That is, as a display parameter, both an indicator of the degree of similarity and a limitation of the number of displayed fragments are simultaneously used.

Степень сходства определяют следующим образом. Осуществляют подсчет количества групп соответствия, входящих в определенное в качестве параметров поиска множество уникальных слов. После чего определяют соотношение количества групп соответствия в каждом из фрагментов, включенных в итоговый список к количеству групп соответствия в определенном в качестве параметров поиска множестве уникальных слов. Полученный показатель может измеряться, например, в процентах. Если в множестве уникальных слов, определенных в качестве параметров поиска, присутствует десять групп соответствия, а в конкретном фрагмент, входящем в итоговый список, восемь групп, то коэффициент степени сходства для данного фрагмента будет равен 8/10=0,8 или 80%. Если в качестве параметров отображения будет задан, например, минимальный показатель - 70%, то фрагменты, включенные в итоговый список и имеющие степень сходства с выбранным фрагментом менее 70%, отображаться не будут.The degree of similarity is determined as follows. Count the number of matching groups included in the set of unique words defined as search parameters. After that, the ratio of the number of matching groups in each of the fragments included in the final list to the number of matching groups in a set of unique words defined as search parameters is determined. The resulting indicator can be measured, for example, in percent. If there are ten matching groups in the set of unique words defined as search parameters, and eight groups in a particular fragment included in the final list, then the similarity coefficient for this fragment will be 8/10 = 0.8 or 80%. If, for example, a minimum indicator of 70% is set as display parameters, fragments included in the final list and having a degree of similarity with the selected fragment of less than 70% will not be displayed.

Если поиск фрагментов, имеющих сходство с выбранным, осуществлялся на удаленных устройствах, то после определения степени сходства осуществляют откачку с удаленных устройств тех документов (фрагментов), которые удовлетворяют заданным параметрам отображения. При этом осуществляют уточняющую сортировку списка фрагментов, полученных от удаленных устройств. Уточняющую сортировку осуществляют на основании данных, полученных в результате пересчета количества групп соответствия, входящих в каждый упомянутый фрагмент. Выполнение данной операции очень актуально, т.к. фрагменты, полученные от удаленных устройств, могут не соответствовать параметрам поиска из-за некачественной отработки поискового запроса удаленными устройствами. В случае полного несоответствия полученного от удаленных устройств фрагмента параметрам поиска упомянутый фрагмент исключают из итогового списка. Кроме того, осуществляют уточняющую сортировку полученного от удаленных устройств списка фрагментов в соответствии с действительным количеством групп соответствия, присущих полученным фрагментам.If the search for fragments having similarity with the selected one was carried out on remote devices, then after determining the degree of similarity, documents (fragments) that satisfy the specified display parameters are pumped out from remote devices. At the same time, a refinement sorting of the list of fragments received from remote devices is carried out. The refinement sorting is carried out on the basis of data obtained as a result of recounting the number of correspondence groups included in each mentioned fragment. The implementation of this operation is very important, because fragments received from remote devices may not match the search parameters due to poor-quality processing of the search query by remote devices. If the fragment received from the remote devices does not match the search parameters, the fragment is excluded from the final list. In addition, refinement sorting of the list of fragments received from the remote devices is carried out in accordance with the actual number of correspondence groups inherent in the obtained fragments.

Далее на этапе 27 осуществляют обработку фрагментов, заносимых в архив. Обработку осуществляют вручную или автоматически, в соответствии с установленными параметрами. Установка параметров для автоматической обработки фрагментов, заносимых в архив, заключается в определении степени сходства, при которой над упомянутыми фрагментами осуществляют операции: удаления, сохранения в архиве, авторубрикации. Т. е. пользователь, например, задает интервалы для показателя степени сходства, в соответствии с которыми осуществляют обработку фрагментов.Next, at step 27, fragments recorded in the archive are processed. Processing is carried out manually or automatically, in accordance with the established parameters. Setting parameters for the automatic processing of fragments recorded in the archive consists in determining the degree of similarity at which operations are performed on the above fragments: deletion, saving in the archive, auto-rubrication. That is, the user, for example, sets intervals for an index of the degree of similarity according to which fragments are processed.

Например, при поиске на удаленных устройствах найдены документы, имеющие 100% показатель степени сходства с выбранным документом, хранимым в архиве. После того как найденные документы будут получены (откачаны) от удаленных устройств, осуществляют их дополнительную проверку на предмет определения 100% сходства с выбранным. Если после проверки определяют, что найденный и выбранный документы действительно идентичны на 100%, полученный документ считают дублем и удаляют. При неполном сходстве выбранного и полученного документов (менее 100%) полученный документ отображают пользователю, визуализируя при этом все отличия текстового содержимого полученного документа от выбранного. В качестве визуализации отличий используют, например, подсветку, выделение цветом, подчеркивание и т.д. При занесении в архив полученного документа он может быть сохранен как новая версия выбранного документа. Причем новую версию сохраняют либо как полную копию полученного документа, либо в виде отличий текстового содержимого полученного документа от текстового содержимого выбранного (ранее сохраненного в архиве) документа. Документы (фрагменты), имеющие более низкую степень сходства с выбранным, например, в интервале от 70 до 90% могут быть просто сохранены в архиве.For example, when searching on remote devices, documents were found that have a 100% indicator of the degree of similarity with the selected document stored in the archive. After the found documents are received (pumped out) from the remote devices, they are additionally checked to determine 100% similarity with the selected one. If after checking it is determined that the found and selected documents are really 100% identical, the received document is considered a take and deleted. If the similarity of the selected and received documents (less than 100%) is incomplete, the resulting document is displayed to the user, while visualizing all the differences between the text content of the received document and the selected one. For visualization of differences, use, for example, highlighting, highlighting, underlining, etc. When the received document is archived, it can be saved as a new version of the selected document. Moreover, the new version is saved either as a full copy of the received document, or in the form of differences in the text content of the received document from the text content of the selected (previously archived) document. Documents (fragments) that have a lower degree of similarity with the selected one, for example, in the range from 70 to 90%, can simply be stored in the archive.

Изобретение на этапе 27 предусматривает автоматическую классификацию фрагментов, заносимых в архив документов, которая реализуется следующим образом. Для полученного фрагмента формируют множество уникальных слов. Каждому разделу классификатора (тематике) в соответствие также поставлено множество уникальных слов, формируемое и обновляемое по мере привязки к тематике новых фрагментов. Для определения соответствующей тематики, к которой целесообразно привязать полученный фрагмент, осуществляют сравнение множества уникальных слов, присущих полученному фрагменту с множествами уникальных слов, присущих каждой из тематик. На основании анализа определяют одну и более тематику, к которой может быть привязан полученный фрагмент. Привязку осуществляют как вручную, так и в автоматическом режиме, в соответ35 ствии с заданными параметрами. При автоматическом режиме задают показатель степени сходства, в соответствии с которым и определяют целесообразность привязки полученного фрагмента к той или иной тематике. Для определения степени сходства при автоматической классификации могут быть использованы не все уникальные слова, присущие тематикам, а, например, наиболее редкие или наоборот, наиболее часто употребимые в текстовом содержимом тематики.The invention at step 27 provides for the automatic classification of fragments recorded in the archive of documents, which is implemented as follows. For the resulting fragment form many unique words. Each section of the classifier (subject) is also associated with many unique words, formed and updated as new fragments are tied to the subject. To determine the appropriate subject matter, to which it is advisable to bind the resulting fragment, a set of unique words inherent in the resulting fragment is compared with many unique words inherent in each subject. Based on the analysis, one or more topics are determined to which the resulting fragment can be attached. Binding is carried out both manually and in automatic mode, in accordance with the specified parameters. In automatic mode, an indicator of the degree of similarity is set, in accordance with which the expediency of linking the resulting fragment to a particular topic is determined. To determine the degree of similarity in automatic classification, not all unique words inherent to the topics can be used, but, for example, the rarest or vice versa, the most frequently used in the textual content of the topic.

Данное изобретение иллюстрируется простым примером реализацииThe invention is illustrated by a simple implementation example.

В качестве примера рассмотрим обработку информации (новостей), приходящей по почтовой рассылке и заносимой в архив. В том случае, если пользователь подписан на несколько рассылок, зачастую одни и те же новости могут полностью, либо частично дублировать друг друга. Кроме того, оперативность получения новостей в каждой рассылке различна, т. е. идентичные новости могут приходить с разницей в несколько дней. Все это приводит к тому, что пользователю приходится осуществлять повторный просмотр одной и той же или идентичной информации.As an example, let us consider the processing of information (news) received by mailing list and recorded in the archive. In the event that the user is subscribed to several newsletters, often the same news can completely or partially duplicate each other. In addition, the speed of receiving news in each newsletter is different, that is, identical news can come with a difference of several days. All this leads to the fact that the user has to re-view the same or identical information.

В данном примере настоящее изобретение позволяет избежать повторного просмотра новостей, а также упорядочить процесс наполнения архива информацией, приходящей по почтовой рассылке. Вся приходящая информация заносится в архив, причем приходящие документы разбивают на фрагменты и индексируют. При этом разбиение документов из конкретной рассылки на фрагменты происходит автоматически в соответствии с правилами, установленными пользователем. Установка правил не является сложной процедурой, поскольку в каждой рассылке фрагмент (новость) ограничивается конкретной последовательностью символов, например определенным заголовком. При определении этих символов в документе и осуществляется автоматическое маркирование конкретной области документа, представляющей собой фрагмент, содержащий новость. В данном примере для разбиения документов на фрагменты (новости) и их индексирования используют единую методику. Это означает, что поиск фрагментов, похожих по текстовому и/или смысловому содержимому на выбранный, будет по умолчанию осуществлен среди всех фрагментов, хранимых в архиве.In this example, the present invention avoids re-viewing the news, as well as streamline the process of filling the archive with information coming from the mailing list. All incoming information is archived, and incoming documents are divided into fragments and indexed. In this case, the division of documents from a specific mailing into fragments occurs automatically in accordance with the rules established by the user. Setting rules is not a complicated procedure, since in each newsletter a fragment (news) is limited to a specific sequence of characters, for example, a specific heading. When determining these characters in a document, it automatically labels a specific area of the document, which is a fragment containing the news. In this example, a single technique is used to break documents into fragments (news) and index them. This means that the search for fragments similar in text and / or semantic content to the selected one will be performed by default among all fragments stored in the archive.

Для каждого нового фрагмента, заносимого в архив, осуществляют поиск в архиве фрагментов, похожих на него по текстовому и/или смысловому содержимому. С этой целью каждый фрагмент разбивают на исходные слова, из которых формируют множества уникальных слов для каждого фрагмента. При этом отсеивают все стоп-слова, после чего формируют список уникальных слов для выбранного фрагмента. Например, новость включает текст:For each new fragment recorded in the archive, the archive is searched for fragments similar to it in textual and / or semantic content. For this purpose, each fragment is divided into source words, from which many unique words are formed for each fragment. At the same time, all stop words are eliminated, after which a list of unique words for the selected fragment is formed. For example, the news includes the text:

'Московское представительство компании Атлант организовало для журналистов прессконференцию с участием президента компании'. Список уникальных слов для выбранного фрагмента будет включать в себя слова: 'Московское', 'представительство', 'компании', 'Атлант', 'организовало', 'журналистов' 'прессконференцию', 'участием' и 'президента'. Из списка уникальных слов исключены стоп-слова 'для' и 'с'."The Moscow representative office of Atlant organized a press conference for journalists with the participation of the president of the company." The list of unique words for the selected fragment will include the words: “Moscow”, “representative office”, “company”, “Atlant”, “organized”, “journalists”, “press conference”, “participation” and “president”. Stop words 'for' and 's' are excluded from the list of unique words.

Каждое из полученных уникальных слов подвергают предварительной обработке с использованием морфологического обработчика. Это позволяет осуществлять поиск новостей, содержащих не только упомянутые уникальные слова, но также и их словоформы. Т.е. будут найдены фрагменты, содержащие текст вида: 'В Москве открылось представительство компании Атлант. В организованной пресс-конференции принял участие президент компании, ответивший на вопросы журналистов'. Подчеркиванием выделены те слова, которые являются словоформами соответствующих им уникальных слов, включенных в выбранный фрагмент.Each of the obtained unique words is pre-processed using a morphological processor. This allows you to search for news containing not only the mentioned unique words, but also their word forms. Those. fragments containing text of the form will be found: 'Atlant representative office opened in Moscow. The president of the company, who answered journalists' questions, took part in the organized press conference. ” Underlined are those words that are word forms of the corresponding unique words included in the selected fragment.

Таким образом определяют, что в архиве уже содержится новость идентичная выбранной. Причем степень сходства фрагментов составляет 100%, т.к. в выбранном и найденном фрагментах имеется по девять идентичных уникальных слов (групп соответствия). Соответственно степень сходства составляет 9/9=1 или 100%. В соответствии с установленными для данного примера параметрами выбранный фрагмент (новость) заносят в архив, причем осуществляют его автоматическую классификацию. Автоматическая классификация заключается в привязке выбранного фрагмента к тому разделу классификатора (тематике), к которому привязан найденный в архиве похожий на него по текстовому и/или смысловому содержимому фрагмент. Аналогичным образом обрабатывают все входящие новости.Thus, it is determined that the archive already contains news identical to the one selected. Moreover, the degree of similarity of the fragments is 100%, because in the selected and found fragments there are nine identical unique words (matching groups). Accordingly, the degree of similarity is 9/9 = 1 or 100%. In accordance with the parameters established for this example, the selected fragment (news) is entered into the archive, and its automatic classification is carried out. Automatic classification consists in linking the selected fragment to that section of the classifier (subject) to which the fragment found in the archive similar to it in text and / or semantic content is attached. All incoming news are handled in the same way.

Способ поиска электронных документов, хранимых на устройствах хранения данных Сущность изобретенияMethod for searching electronic documents stored on data storage devices

Способ заключается в индексировании каждого сохраняемого на устройстве хранения данных документа, определении параметров поиска, осуществлении поиска документов на устройствах хранения данных и ранжировании полученных результатов (списков документов) в соответствии с определенными параметрами. В качестве параметров поиска определяют два и более блока информации, указывают интервал между ними, а также порядок их чередования в искомом документе в указанном интервале.The method consists in indexing each document stored on the data storage device, determining search parameters, searching for documents on the data storage devices and ranking the results (lists of documents) in accordance with certain parameters. As the search parameters, two or more blocks of information are determined, the interval between them is indicated, as well as the order of their alternation in the desired document in the specified interval.

Формируемые поисковые запросы расширяют посредством предварительной обработки одного или более блоков информации, входящих в исходный запрос, и осуществляют поиск с использованием множества блоков информации, сформированного посредством предвари37 тельной обработки. В качестве предварительной обработки исходного запроса используют операцию получения, по меньшей мере, из одного исходного блока информации, одного или нескольких блоков информации, связанных с исходным блоком информации, заданным соотношением. При этом поиск документов осуществляют с использованием любого заданного количества блоков информации из числа определенных в качестве параметров поиска при формировании исходного запроса.The generated search queries are expanded by pre-processing one or more blocks of information included in the original query, and a search is performed using a plurality of blocks of information generated by pre-processing. As the preliminary processing of the initial request, the operation of obtaining from at least one initial block of information, one or more blocks of information associated with the original block of information specified by the ratio is used. In this case, the search for documents is carried out using any given number of information blocks from among those defined as search parameters when generating the initial request.

В качестве блока информации используют находящуюся в документе последовательность символов, ограниченную определенными символами. Для разбиения документов на множество блоков информации используют программный модуль обработки документов.As a block of information using a sequence of characters in the document, limited to certain characters. To split documents into many blocks of information using a software module for processing documents.

Для расширения исходного запроса выбирают, по меньшей мере, одну операцию предварительной обработки, причем последовательность выполнения операций предварительной обработки задают пользователи. При этом изобретение в качестве предварительной обработки запроса предусматривает использование одной логической операции тождества. После выполнения операции предварительной обработки исходный блок информации либо удаляют из результирующего запроса, либо оставляют в результирующем запросе.To extend the initial request, at least one pre-processing operation is selected, and the sequence of pre-processing operations is set by users. Moreover, the invention as the preliminary processing of the request provides for the use of a single logical operation of identity. After the pre-processing operation is completed, the initial block of information is either removed from the resulting query or left in the resulting query.

Поиск документов, соответствующих параметрам поиска, осуществляют на локальных и/или удаленных устройствах хранения данных. В качестве удаленных устройств используют любой информационный ресурс или систему, предназначенную для поиска данных, функционирующую в компьютерной сети и предоставляющую в ответ на поисковый запрос список документов, удовлетворяющих параметрам поиска.Documents matching the search parameters are searched on local and / or remote data storage devices. As remote devices, use any information resource or system designed to search for data, operating on a computer network and providing, in response to a search query, a list of documents that satisfy the search parameters.

Полученные результаты поиска (списки документов) упорядочивают в соответствии с количеством вхождений в документы последовательностей блоков информации, удовлетворяющих параметрам поиска и длиной интервалов, в которые входят упомянутые последовательности. При упорядочивании списка документов также учитывают весовые коэффициенты, присвоенные блокам информации.The obtained search results (lists of documents) are ordered in accordance with the number of occurrences of documents of sequences of information blocks satisfying the search parameters and the length of the intervals in which the mentioned sequences are included. When arranging the list of documents, the weighting factors assigned to the information blocks are also taken into account.

Найденные фрагменты документов отображают с визуализацией в них последовательностей блоков информации, соответствующих параметрам поиска и запросу, полученному после предварительной обработки. Изобретение позволяет осуществлять навигацию по текстовому содержимому документов с использованием входящих в них последовательностей блоков информации в пределах всего списка документов, полученного в результате поиска.The found fragments of documents are displayed with visualization of sequences of blocks of information corresponding to the search parameters and the request received after preliminary processing. The invention allows navigation through the text content of documents using the sequences of information blocks included in them within the entire list of documents obtained as a result of the search.

Описание изобретенияDescription of the invention

Данный способ предусматривает формирование на устройстве хранения данных архива документов, в котором все сохраняемые доку менты проиндексированы. Для разбиения документов на исходные блоки информации (исходные слова) используют программный модуль обработки документов определенного типа. Использование модулей обработки документов определенного типа для разбиения документов на исходные слова позволяет осуществлять полнотекстовый поиск документов различных форматов, имеющих текстовое содержимое. Изобретение предусматривает формирование программных модулей обработки документов определенного типа пользователями.This method involves forming a document archive on the data storage device in which all stored documents are indexed. To break documents into source blocks of information (source words), a software module for processing documents of a certain type is used. The use of document processing modules of a certain type for dividing documents into source words allows full-text search of documents of various formats with text content. The invention provides for the formation of software modules for processing documents of a certain type by users.

Способ реализуют следующим образом. На этапе 28, как это показано на фиг. 9, формируют исходный запрос, состоящий, по меньшей мере, из двух исходных слов. Кроме того, на этапе 28 осуществляют определение параметров поиска, в качестве которых используютThe method is implemented as follows. At step 28, as shown in FIG. 9, form an initial query consisting of at least two source words. In addition, at step 28, search parameters are determined, which are used as

- указание максимального интервала, в котором должны находиться исходные слова в искомой фразе в документе,- an indication of the maximum interval in which the source words should be in the search phrase in the document,

- указание порядка чередования упомянутых исходных слов в заданном интервале. Порядок чередования исходных слов может быть задан как произвольный, так и жестко определенный, в зависимости от предпочтений пользователя,- an indication of the alternation of the mentioned source words in a given interval. The order of alternation of the source words can be specified either arbitrary or rigidly defined, depending on the user's preferences,

- определение минимального количества слов из числа исходных слов, включенных в запрос, для которого будут осуществлять поисковую операцию. Например, исходный запрос состоит из четырех слов, но осуществляют поиск документов, содержащих фразы из не менее чем трех любых слов, включенных в исходный запрос. Соответственно будут найдены документы, содержащие фразы как с четырьмя словами, так и с комбинациями из трех любых исходных слов,- determination of the minimum number of words from the number of source words included in the query for which the search operation will be performed. For example, the original query consists of four words, but they search for documents containing phrases of at least three of any words included in the original query. Accordingly, documents containing phrases with both four words and combinations of any three source words will be found,

- определение слов, которые должны присутствовать в искомой фразе в документе в обязательном порядке (обязательные слова). Например, для запроса по словам 'необходимая информация' обязательным для включения в искомую фразу является слово 'информация'. Таким образом, в качестве искомых будут определены фразы, включающие либо оба слова 'необходимая' и 'информация', либо только слово 'информация'. Фразы, включающие только одно слово 'необходимая' (необязательное слово) для данного примера как искомые определены не будут- Definition of words that must be present in the searched phrase in the document without fail (required words). For example, for a query with the words 'necessary information', the word 'information' is mandatory for inclusion in the searched phrase. Thus, phrases that include either the words 'necessary' and 'information', or only the word 'information' will be defined as the ones searched. Phrases that include only one word 'necessary' (optional word) for this example will not be defined as searched

- указание весовых коэффициентов для слов, входящих в исходный запрос. Весовые коэффициенты определяют значимость слов и в дальнейшем будут использованы для определения релевантности найденных документов исходному запросу при формировании и упорядочивании итогового списка найденных документов.- indication of weights for the words included in the original query. Weighting factors determine the significance of words and will be used in the future to determine the relevance of the documents found to the original query when forming and organizing the final list of documents found.

На этапе 29 определяют исходные слова, подлежащие предварительной обработке, и осуществляют их предварительную обработку.At step 29, the source words to be pre-processed are determined and pre-processed.

Предварительная обработка расширяет поисковый запрос различными аналогами, что способствует поиску документов, содержащих фразы, релевантные исходному запросу не только в формальном, но и в смысловом значении. Таким образом, на этапе 29 осуществляют формирование групп соответствия для каждого из исходных слов. Например, в процессе предварительной обработки слова 'информация' с использованием словаря синонимов получают группу соответствия, в которую входят слова 'информация', 'сведения' и 'сообщение'. Частным случаем операции предварительной обработки может быть одна логическая операция тождества. При этом для слова 'информация' будет сформирована группа, состоящая из одного тождественного слова, т.е. слова 'информация'. Операция тождества выполняется для тех исходных слов, которые не были определены на этапе 29 как подлежащие предварительной обработке.Pre-processing expands the search query with various analogues, which helps to search for documents containing phrases that are relevant to the original query, not only in formal but also in semantic meaning. Thus, at step 29, correspondence groups are formed for each of the source words. For example, in the process of pre-processing the word 'information' using the synonym dictionary, a correspondence group is obtained that includes the words 'information', 'information' and 'message'. A special case of a preprocessing operation may be one logical identity operation. Moreover, for the word 'information' a group will be formed consisting of one identical word, i.e. the words 'information'. The identity operation is performed for those source words that were not determined at step 29 as being subject to preprocessing.

Далее на этапе 30 осуществляют формирование списков документов для групп соответствия, связанных со словами, входящими в поисковый запрос, после чего переходят к выполнению этапа 31. На этапе 31определяют, включены ли в запрос обязательные слова.Next, at step 30, lists of documents are generated for the correspondence groups associated with the words included in the search query, after which they proceed to step 31. At step 31, it is determined whether the required words are included in the query.

Если обязательные слова включены в запрос, то переходят к выполнению этапа 32, на котором в итоговый список документов для обязательных слов (ИСО) заносят список документов, сформированный для группы соответствия связанной с первым обязательным словом, входящим в поисковый запрос. Далее на этапе 33 определяют, все ли обязательные слова, входящие в поисковый запрос, обработаны. Если нет, то на этапе 34 обращаются к группе соответствия, связанной со следующим обязательным словом, входящим в поисковый запрос и присваивают ей статус текущей. После чего на этапе 35 осуществляют формирование итогового списка документов с использованием правила: ИСО=ИСО' ΛΝΏ 'СД текущей группы соответствия'If the required words are included in the query, they proceed to step 32, where the list of documents generated for the correspondence group associated with the first required word in the search query is entered into the final list of documents for required words (ISO). Next, at step 33, it is determined whether all the required words included in the search query have been processed. If not, then at step 34 they turn to the correspondence group associated with the next required word in the search query and assign it the current status. Then, at step 35, the final list of documents is generated using the rule: ISO = ISO 'ΛΝΏ' SD of the current compliance group '

Т.е. на этапе 35 в ИСО заносят документы из списка полученного для конкретной группы соответствия, связанной с обязательным словом, включенным в поисковый запрос. При этом в итоговый список включают только те документы, которые одновременно представлены в ИСО и в СД текущей группы соответствия.Those. at step 35, documents are entered into the ISO from the list received for a particular compliance group associated with the required word included in the search query. Moreover, only those documents that are simultaneously presented in ISO and in the Board of Directors of the current compliance group are included in the final list.

Этапы 34 и 35 выполняют до тех пор, пока на этапе 33 не определят, что все обязательные слова из поискового запроса обработаны. После чего итоговый список документов для обязательных слов будет сформирован и в поле для подсчета количества групп соответствия для каждого документа в ИСО будет записано значение, равное количеству обработанных обязательных слов. Сформированный на этапе 35 ИСО включает в себя документы, в которых представлены все обязательные слова.Steps 34 and 35 are performed until at step 33 it is determined that all the required words from the search query have been processed. After that, the final list of documents for required words will be generated and in the field for counting the number of matching groups for each document in ISO will be written a value equal to the number of processed required words. The ISO formed at step 35 includes documents in which all required words are presented.

Далее переходят к выполнению этапа 36, на котором определяют, включены ли в запрос необязательные слова. Если необязательные слова в запросе отсутствуют, то на этапе 37 формируют итоговый список документов (ИС), который включает в себя список документов, полученный для обязательных слов (ИС=ИСО).Next, proceed to step 36, which determines whether optional words are included in the request. If there are no optional words in the request, then at step 37 they form the final list of documents (IS), which includes the list of documents obtained for the required words (IS = ISO).

Если в запросе присутствуют необязательные слова, то на этапе 38 формируют список документов для необязательных слов с использованием алгоритма формирования списка документов для слов, объединенных в запросе оператором 'ОК'. После чего на этапе 39 на основании ИСО и ИСН формируют итоговый список документов (ИС). Итоговый список будет включать в себя все документы из ИСО. При этом для документов итогового списка, которые одновременно представлены в ИСО и в ИСН, осуществляют подсчет количества групп соответствия. Т.е. подсчет количества групп осуществляют для тех документов, в которых одновременно присутствуют обязательные и необязательные слова. Документы, которые содержат только обязательные слова, будут представлены в итоговом списке с показателем значения поля для подсчета количества групп соответствия, полученным для них при формировании ИСО.If optional words are present in the request, then at step 38 a list of documents for optional words is generated using the algorithm for generating a list of documents for words combined in the request by the 'OK' operator. Then, at step 39, on the basis of ISO and ISN, a final list of documents (IS) is formed. The final list will include all documents from ISO. At the same time, for documents of the final list, which are simultaneously presented in ISO and in the ISS, they count the number of compliance groups. Those. the calculation of the number of groups is carried out for those documents in which at the same time there are obligatory and optional words. Documents that contain only required words will be presented in the final list with an indicator of the field value for counting the number of correspondence groups obtained for them during the formation of the ISO.

При подсчете количества групп соответствия суммируют показатели значений полей для подсчета количества групп соответствия документа, включенного в ИСО и соответствующего ему документа из ИСН. Например, запрос состоит из фразы 'поиск необходимой информации', причем два слова - 'поиск' и 'информация' являются обязательными. Документ1 содержит оба обязательных слова и потому при формировании ИСО значение поля для подсчета количества групп соответствия Документ1 примет значение два (по количеству обязательных слов).When counting the number of conformity groups, the indicators of the field values are summarized to calculate the number of conformance groups of the document included in the ISO and the corresponding document from the ISN. For example, a query consists of the phrase 'search for necessary information', and two words - 'search' and 'information' are required. Document1 contains both required words, and therefore, when forming the ISO, the value of the field for counting the number of conformance groups Document1 will take two values (by the number of required words).

Помимо двух обязательных слов Документ1 также содержит необязательное слово 'необходимой' и потому будет включен и в ИСН. Причем значение поля для подсчета количества групп соответствия для Документ1, включенного в ИСН, примет значение один (одно необязательное слово).In addition to the two required words, Document1 also contains the optional word 'necessary' and therefore will be included in the IOS. Moreover, the value of the field for counting the number of correspondence groups for Document1 included in the ISN will be one (one optional word).

Соответственно на этапе 39 после формирования ИС значение поля для подсчета в Документ1 количества групп соответствия примет значение три (2+1). Полученные показатели значений количества групп соответствия, присущие документам включенным в ИС, учитывают при упорядочивании итогового списка.Accordingly, at step 39, after the formation of the IS, the value of the field for counting the number of correspondence groups in Document1 will be three (2 + 1). The obtained indicators of the values of the number of conformity groups inherent in the documents included in the IP are taken into account when organizing the final list.

Если на этапе 31 определяют, что в запросе отсутствуют обязательные слова, то переходят к выполнению этапа 40, на котором формируют итоговый список документов (ИСН) для необязательных слов, включенных в запрос (все слова входящие в запрос) с использованием алгоритма формирования списка документов для слов, объединенных в запросе оператором 'ОК'. После чего формируют итоговый список, который включает в себя список документов, полученный для необязательных слов (ИС=ИСН).If at step 31 it is determined that the request contains no required words, then they proceed to step 40, where they form a final list of documents (ISN) for the optional words included in the request (all words included in the request) using the algorithm for generating the list of documents for words combined in the query by the 'OK' operator. Then form the final list, which includes a list of documents obtained for optional words (IS = ISN).

После того, как на этапах 37, 39 или 40 будет сформирован итоговый список документов для слов, включенных в исходный запрос, переходят к выполнению этапа 41, на котором осуществляют упорядочивание итогового списка документов, его обработку и отображение.After the final list of documents for words included in the initial query is generated at steps 37, 39 or 40, they proceed to step 41, where the final list of documents is organized, processed and displayed.

Если поиск документов осуществляли в хранимом архиве, то итоговый список включает в себя документы, соответствующие параметрам поиска, и такой список не нуждается в дополнительной обработке. Поскольку формирование итогового списка документов, полученных из архива, сопровождается его оптимизацией с использованием дополнительных индексов (проверка интервалов, порядка чередования слов и т. д.). Если же поиск документов осуществлялся на удаленных устройствах, то итоговый список документов откачивают с удалённых устройств и проверяют на предмет соответствия включенных в него документов параметрам поиска. При этом откачку полных копий документов, найденных на удаленных устройствах, осуществляют только один раз - после того, как будет сформирован итоговый список документов. Проверка документов, полученных от удаленных устройств, очень актуальна, т.к. упомянутые документы могут не соответствовать параметрам поиска из-за некачественной отработки поискового запроса удаленными устройствами. В случае полного несоответствия, полученного от удаленных устройств документа параметрам поиска, упомянутый документ исключают из итогового списка. Т. е. из итогового списка исключают те документы, в которых не содержится ни одной фразы, удовлетворяющей параметрам поиска.If the search for documents was carried out in a stored archive, then the final list includes documents that match the search parameters, and such a list does not need additional processing. Since the formation of the final list of documents received from the archive is accompanied by its optimization using additional indexes (checking intervals, word sequence, etc.). If the search for documents was carried out on remote devices, then the final list of documents is pumped from the remote devices and checked for compliance of the documents included in it with the search parameters. In this case, pumping out full copies of documents found on remote devices is carried out only once - after the final list of documents has been generated. Checking documents received from remote devices is very relevant, because the mentioned documents may not match the search parameters due to poor-quality processing of the search query by remote devices. In the event of a complete discrepancy received from the remote devices of the document with the search parameters, the said document is excluded from the final list. That is, those documents that do not contain a single phrase satisfying the search parameters are excluded from the final list.

Например, условиями поиска является введение ограничения на интервал между искомыми словами, включенным в исходный запрос 'ба1а тападетей дуДет' - не более двух (ΝΕΑΚ 2). При этом учитывают предварительную обработку запроса, в результате которой искомые документы включают слова 'ба1а'. 'тападетепГ и 'дуДет' и '4а1а', 'соп!го1' и 'дуДет'. Таким образом, в итоговый список документов будут включены следующие документы. Документ1, который содержит фразу '...соттишсаОопх дуд1ет, 111егта1 соп!го1 Ьоагб апб ба1а сойгоГ и Документ, который содержит фразу '... ипййрйей 1еа4ег о£ 4а1а тападетей дуДет тагке!'. И хотя в текстах обоих документов представлены слова из запроса, в том числе расширенные посредством предварительной обработки, первый документ (Документ1) не соответствует параметрам запроса (интервал между блоками информации не более двух). Поскольку фраза '...соштишсайопк дуДет, 111егта1 соп!го1 Ьоагб апб ба1а сойго1' не соответствует условиям поискового запроса, документ, ее содержащий, (Документ1) автоматически будет удален из итогового списка. Аналогичным образом осуществляют проверку на соответствие поисковым параметрам в случае указания четкой последовательности чередования слов в искомом документе. Например, если в качестве четкой последовательности укажут '4а1а, та^детет, дуДет' или '4а1а, сойго1, дуДет', то Документ1 будет исключен из итогового списка и по этому признаку, так как в нем нарушена заданная последовательность чередования искомых слов.For example, a search condition is to impose a restriction on the interval between the search words included in the initial query 'ba1a tapadet duDet' - no more than two (ΝΕΑΚ 2). In this case, preliminary processing of the request is taken into account, as a result of which the documents sought include the words 'ba1a'. 'tapadetepG and' duDet 'and' 4a1a ',' sop! go1 'and' duDet '. Thus, the following documents will be included in the final list of documents. A document1 that contains the phrase '... sottyssaOopkh duduet, 111egta1 sop! Go1 Boagb apb ba1a soogoG and a Document that contains the phrase' ... ipyryyeye 1aa4eg o £ 4a1a tapadete duDet tag! '. And although the texts from both documents contain words from the request, including those expanded by preliminary processing, the first document (Document1) does not correspond to the request parameters (the interval between the information blocks is no more than two). Since the phrase '... ш с д д д,,,,,,,,,, соп ег ег, 1 1 1 1 аг аг аг аг аг' '' '' '' не не не не не не не не не не не не не не не не не не не не не не не не не документ не не не не документ не документ не документ документ документ документ документ документ документ документ документ документ документ документ документ документ документ документ документ документ документ 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111 111ет 111 Similarly, they check for compliance with the search parameters in the case of indicating a clear sequence of alternating words in the desired document. For example, if '4a1a, ta ^ det, duDet' or '4a1a, sogo1, duDet' is indicated as a clear sequence, then Document1 will be excluded from the final list by this criterion, since the given sequence of alternating the searched words is violated in it.

На этапе 41 итоговый список документов, полученных при фразовом поиске, упорядочивают в соответствии с релевантностью документов по отношению к поисковому запросу, полученному с учетом предварительной обработки. При этом учитывают количество искомых последовательностей обработанных слов и длину интервалов, в которые входят упомянутые последовательности. Кроме того, учитывают порядок чередования слов в запросе, а также дополнительные параметры, установленные пользователем для поиска документов, содержащих фразы с количеством слов Ν и менее. Количество слов определяется показателем количества групп соответствия, присущих документам. При упорядочивании списков приоритетом пользуются те документы, которые включают фразы, содержащие все Ν слов и далее в порядке убывания (Ν-1, Ν-2 и т.д.). В случае равенства количества слов во фразах, присутствующих в документах (например, Ν), приоритет отдается тому документу, в котором слова встречаются в минимальном интервале. Например, если итоговый список формируется с учетом интервала между словами десять и произвольным чередованием упомянутых слов в искомом интервале, то оба документа из приведенного выше примера (Документ1 и Документ2) будут соответствовать указанным параметрам. Но при этом для Документ1 интервал между словами составляет пять, а для Документ2 интервал между словами составляет два. Таким образом, на первое место в итоговом списке будет вынесен Документ2. При расчете интервала в данном случае не использовалась возможность исключения так называемых стоп-слов: предлогов и т.д. Тем не менее, настоящее изобретение предусматривает возможность исключения стоп-слов из последовательностей искомых слов при определении интервала между ними.At step 41, the final list of documents received during phrasal search is sorted in accordance with the relevance of the documents with respect to the search query received taking into account the preliminary processing. This takes into account the number of searched sequences of processed words and the length of the intervals in which the mentioned sequences are included. In addition, they take into account the order of alternating words in the query, as well as additional parameters set by the user to search for documents containing phrases with the number of words Ν or less. The number of words is determined by the indicator of the number of matching groups inherent in the documents. When organizing lists, priority is given to those documents that include phrases containing all Ν words and then in descending order (Ν-1, Ν-2, etc.). If the number of words in the phrases present in the documents is equal (for example, Ν), priority is given to the document in which the words occur in the minimum interval. For example, if the final list is formed taking into account the interval between words ten and an arbitrary alternation of the words in the search interval, then both documents from the above example (Document1 and Document2) will correspond to the specified parameters. But at the same time, for Document1, the word spacing is five, and for Document2, the word spacing is two. Thus, Document2 will be placed first in the final list. When calculating the interval in this case, the opportunity to exclude the so-called stop words: prepositions, etc., was not used. However, the present invention provides the ability to exclude stop words from the sequences of the searched words when determining the interval between them.

Если и минимальный интервал между искомыми словами оказывается одинаков для нескольких документов, включенных в итоговый список, для них осуществляют подсчет последовательностей искомых слов, входящих в эти документы. При этом сначала подсчитывают последовательности слов с минимальным совпадающим интервалом и в случае равенства этого показателя переходят к подсчету количества последовательностей, имеющих интервал 'минимальный совпадающий интервал +1'. Данную операцию при необходимости осуществля ют до тех пор, пока значение интервала не достигнет максимального значения, определяемого заданными параметрами поиска. Т.е., если документы содержат одинаковое количество последовательностей слов с интервалом два, переходят к подсчету количества последовательностей с интервалом три и т. д.If the minimum interval between the searched words is the same for several documents included in the final list, for them, the sequences of the searched words included in these documents are counted. In this case, the word sequences with the minimum matching interval are first counted and, if this indicator is equal, they proceed to counting the number of sequences having the interval of 'minimal matching interval +1'. If necessary, this operation is carried out until the interval value reaches the maximum value determined by the specified search parameters. That is, if the documents contain the same number of sequences of words with an interval of two, they go on to count the number of sequences with an interval of three, etc.

В качестве дополнительных параметров для упорядочивания итогового списка используют весовые коэффициенты, присвоенные искомым словам. В этом случае приоритет будет отдан фразе, имеющей наивысший суммарный весовой коэффициент, который формируется, например, как сумма весовых коэффициентов всех слов, входящих в искомую фразу. В случае равенства всех показателей документы в итоговом списке могут быть упорядочены в соответствии с установленными настройками. Например, приоритет может быть присвоен по дате последнего обращения к документам и т.д.As additional parameters for organizing the final list, weights are used assigned to the searched words. In this case, priority will be given to the phrase having the highest total weight coefficient, which is formed, for example, as the sum of the weight coefficients of all the words included in the searched phrase. If all indicators are equal, the documents in the final list can be sorted in accordance with the settings. For example, priority can be assigned by the date of the last access to documents, etc.

Итоговый список документов может включать в себя документы, в которых содержатся не все слова, определенные в качестве параметров поиска. Такая ситуация возникает, когда в исходный запрос включены как обязательные, так и необязательные слова и в качестве параметров поиска определяют минимальное количество слов, которое должно присутствовать в искомых фразах. В таком случае итоговый список может включать в себя документы, не соответствующие параметрам поиска. Например, исходный запрос включает в себя семь слов, три из которых определены как обязательные для включения в искомую фразу. При этом в качестве параметра поиска задают минимальное количество слов, которое должно присутствовать в искомых фразах, например, пять слов. Но итоговый список документов включает в себя документы, содержащие фразы с любым количеством слов в интервале от трех до семи (три обязательных слова и любое количество необязательных слов). Т.е. в итоговый список включены документы, содержащие фразы с количеством слов три и четыре, что не соответствует параметрам запроса. Такие документы на этапе 41 будут исключены из итогового списка документов, полученного при фразовом поиске.The final list of documents may include documents that do not contain all the words defined as search parameters. Such a situation arises when both required and optional words are included in the initial query and the minimum number of words that must be present in the searched phrases is determined as search parameters. In this case, the final list may include documents that do not match the search parameters. For example, the original query includes seven words, three of which are identified as mandatory for inclusion in the search phrase. At the same time, the minimum number of words that must be present in the searched phrases, for example, five words, is specified as a search parameter. But the final list of documents includes documents containing phrases with any number of words in the range from three to seven (three required words and any number of optional words). Those. the final list includes documents containing phrases with the number of words three and four, which does not match the query parameters. Such documents at step 41 will be excluded from the final list of documents obtained by phrasal search.

В исходный запрос могут быть включены несколько фраз, объединенных булевскими операторами 'ОК', 'ΑΝΏ' и т.д. В этом случае описанный выше алгоритм для фразового поиска выполняется для каждой из фраз, включенных в исходный запрос. После чего итоговые списки документов, полученные для каждой из фраз, подвергают операции комбинирования с использованием соответствующих упомянутых булевских операторов.Several phrases combined by the Boolean operators 'OK', 'ΑΝΏ', etc. can be included in the original query. In this case, the above phrase search algorithm is performed for each of the phrases included in the original query. After that, the final lists of documents obtained for each of the phrases are subjected to combination operations using the corresponding mentioned Boolean operators.

Далее на этапе 42 осуществляют отображение интересующих пользователя документов из итогового списка с визуализацией всех встречающихся в них последовательностей искомых слов. При визуализации учитывают па раметры запроса (интервал между словами и порядок их чередования). В качестве визуализации используют подсветку результатов поиска, подчеркивание, выделение цветом и т.д. Визуализация результатов поиска сразу адресует пользователя к нужному ему месту в документе. При этом изобретение предусматривает возможность навигации по документу (по визуализированным в тексте документа последовательностям искомых слов), причем навигацию осуществляют в пределах всего итогового списка документов.Next, at step 42, the documents of interest to the user from the final list are displayed with the visualization of all sequences of searched words found in them. When visualizing, query parameters are taken into account (the interval between words and the order of their alternation). As visualization, highlighting of search results, underlining, highlighting, etc. are used. Visualization of the search results immediately addresses the user to the desired place in the document. Moreover, the invention provides for the possibility of navigation through the document (by the sequences of the searched words visualized in the text of the document), and navigation is carried out within the entire list of documents.

Имеется архив документов, полученных из разнородных информационных источников. При этом в архиве хранятся документы различных форматов: веб страницы, полученные из Интернет по ссылкам, предоставленным Интернет поисковыми машинами в ответ на поисковые запросы, документы формата М1сго§ой О£йсе (\Уогб. Ехсе1), полученные с жестких дисков компьютеров, функционирующих в локальной сети и другие документы, содержащие в себе текстовую информацию. Все хранимые в архиве документы проиндексированы.There is an archive of documents received from diverse information sources. Moreover, documents of various formats are stored in the archive: web pages received from the Internet via links provided by the Internet by search engines in response to search queries, documents of the format М1сго§ой О £ ісе (\ Уугб. Ехсе1), obtained from the hard drives of computers functioning in the local network and other documents containing textual information. All documents stored in the archive are indexed.

Пользователя интересует получение списка хранимых в архиве документов, содержащих данные об информационных технологиях, а точнее о конференциях, посвященных данной тематике. Для получения интересующего пользователя списка документов формируют запрос 'конференция по информационным технологиям' с учетом интервала между словами и порядка чередования слов в запросе (фразовый поиск). В качестве параметров поиска устанавливают интервал между словами четыре и произвольный порядок чередования слов в искомых документах. Установка такого интервала между словами вызвана тем, что искомая фраза может иметь вид 'конференция, посвященная современным информационным технологиям' и при задании меньшего интервала документ, содержащий такую фразу, обнаружен не будет. Все слова в исходном запросе определены как обязательные. Упомянутый запрос преобразуется в запрос вида:The user is interested in obtaining a list of documents stored in the archive containing information about information technology, and more specifically about conferences dedicated to this topic. To obtain the list of documents of interest to the user, a request is made for the 'information technology conference' taking into account the interval between words and the order of alternating words in the request (phrasal search). As the search parameters, the interval between the words is set to four and an arbitrary order of alternating words in the searched documents. The setting of such an interval between words is caused by the fact that the desired phrase can be in the form of a 'conference on modern information technologies' and if a smaller interval is set, a document containing such a phrase will not be detected. All words in the original query are identified as required. The mentioned request is converted to a request of the form:

'(конференция, информационным, технологии)/ЯЕАК(4)','(conference, information, technology) / YEAK (4)',

т.е. из него исключается стоп-слово 'по'. Для того чтобы повысить результативность поиска, используют предварительную обработку сформированного запроса, которая заключается в использовании двух обработчиков: словаря синонимов и русско-английского словаря для перевода. При этом предварительной обработке подвергают все слова, входящие в запрос. Последовательность выполнения операций предварительной обработки задается пользователем и предполагает в первую очередь использование семантического тезауруса, а затем русскоанглийского словаря перевода.those. the stop word 'by' is excluded from it. In order to increase the effectiveness of the search, preliminary processing of the generated request is used, which consists in using two processors: a synonym dictionary and a Russian-English dictionary for translation. At the same time, all the words included in the request are subjected to preliminary processing. The sequence of pre-processing operations is set by the user and involves first of all the use of a semantic thesaurus, and then the Russian-English translation dictionary.

В результате выполнения операций предварительной обработки формируют группы соответствия для каждого из слов, входящих в исходный запрос. Таким образом, для данного случая будут сформированы три группы соответствия, например: [конференция, семинар, еопГсгспес. хеттаг]; [информационные, ίηίοΓшайои]; [технологии, 1ес11по1офех|. В пределах каждой группы слова объединяются булевским оператором 'ОН'. Итоговый запрос будет иметь следующий вид:As a result of the preliminary processing operations, correspondence groups are formed for each of the words included in the initial request. Thus, for this case three groups of correspondence will be formed, for example: [conference, seminar, eGGsGspes. hettag]; [informational, ίηίοΓshayoi]; [technology, 1ec11po1ofekh |. Within each group, words are combined by the Boolean operator 'OH'. The final request will be as follows:

((конференция ОН семинар ОН еопГегеисе ОН зетшаг), (информационные ОН тГогтайоп), (технологии ОН 1есЬпо1од1е8))/ЫЕАЯ(4)((conference OH seminar OH eGegeise OH zetstep), (information OH tGogtayop), (technology OH 1сппо1од1е8)) / НЕАЯ (4)

Итоговый список документов включает документы, содержащие, по меньшей мере, одно из следующих словосочетаний (фраз): 'конференция, посвященная современным информационным технологиям' (Документ1), 'семинар, посвященный информационным технологиям' (Документ2), 'тГогшаНоп 1есНпо1о§1ех сопГегепсе' (ДокументЗ) или 'тГогшаНоп 1ес11по1още5 хетта г' (Документ4) и т.д. Причем интервал между словами, входящими в фразу, не должен превышать четыре.The final list of documents includes documents containing at least one of the following phrases (phrases): 'conference on modern information technologies' (Document1), 'seminar on information technologies' (Document2), 'tGogshaNop 1esNpo1o§1ekh sopGgepse' (Document3) or 'tGogshaNop 1es11po1oshche5 Hetta g' (Document4), etc. Moreover, the interval between the words included in the phrase should not exceed four.

Полученный список упорядочивают в соответствии с поисковыми параметрами и отображают пользователю. При этом документ, содержащий фразу 'тГогтайоп 1есНпо1од1ех хеттаг' (Документ4), будет иметь более высокий рейтинг (место в списке), нежели документ, содержащий фразу 'конференция, посвященная современным информационным технологиям' (Документ1). Поскольку интервал между словами в Документ4 меньше, чем интервал между словами в Документ1 (2 и 4 соответственно). Для документов, содержащих фразы с одинаковым интервалом между словами - Документ3 и Документ4, осуществляют подсчет количества фраз, отвечающих поисковым запросам и имеющих интервал между словами два. При этом определяют, что Документ3 содержит на одну фразу больше и таким образом ему присваивают более высокий рейтинг. Окончательный вид отображаемого списка документов будет следующим: на первом месте расположен Документ3, а далее в порядке очередности идут Документ4, Документ2 и Документ1.The resulting list is sorted according to the search parameters and displayed to the user. At the same time, a document containing the phrase 'tGogtaiop 1ecNpo1od1ekh hittag' (Document4) will have a higher rating (place in the list) than a document containing the phrase 'conference on modern information technologies' (Document1). Since the spacing between words in Document4 is less than the spacing between words in Document1 (2 and 4, respectively). For documents containing phrases with the same spacing between words - Document3 and Document4, the number of phrases matching search queries and having an interval between two words is counted. At the same time, it is determined that Document3 contains one more phrase and thus is assigned a higher rating. The final form of the displayed list of documents will be as follows: in the first place is Document3, and then in order of priority are Document4, Document2 and Document1.

Далее пользователь обращается к интересующему его документу с целью его просмотра. При отображении интересующего пользователя документа осуществляется визуализация результатов поиска, соответствующих запросу, полученному с учетом предварительной обработки. Таким образом, при обращении, например, к Документ2 осуществляют подсветку всей последовательности слов, отвечающей параметрам поиска и запросу, полученному после предварительной обработки:Next, the user accesses the document of interest to view it. When a document of interest to a user is displayed, the search results corresponding to the query received taking into account the preliminary processing are visualized. Thus, when accessing, for example, Document2, the entire sequence of words that matches the search parameters and the query received after preliminary processing is highlighted:

'семинар посвященный информационным технологиям', хотя слово 'семинар' и не входило в исходный запрос: '(конференция, информационным, технологии)/ЫЕАК(4·).'seminar on information technology', although the word 'seminar' was not included in the original request: '(conference, information, technology) / ЕЕАК (4 ·).

Рассмотренные примеры реализации изобретений являются лишь одними из множества возможных примеров, не исключающих других вариантов реализации, и в них могут вноситься изменения людьми с навыками работы с настоящими изобретениями.The considered examples of the implementation of the inventions are just one of many possible examples that do not exclude other variants of implementation, and people with skills to work with the present inventions can make changes to them.

Другие реализации или разновидности реализации изобретений осуществляются в рамках данных изобретений, которые определяются патентной формулой.Other implementations or varieties of implementations of the inventions are within the scope of these inventions, which are defined by the patent claims.

Промышленная применимостьIndustrial applicability

Изобретение относится к способам поиска похожих по текстовому и/или смысловому содержимому фрагментов в электронных документах, хранимых на устройствах хранения данных. Изобретение также предназначено для поиска документов по запросам, состоящим из двух и более слов, расширенных смысловыми аналогами, с учетом интервала между упомянутыми словами и порядка их чередования в запросе.The invention relates to methods for searching fragments similar in text and / or semantic content in electronic documents stored on data storage devices. The invention is also intended to search for documents by query, consisting of two or more words, extended semantic analogues, taking into account the interval between the words and the order of their alternation in the query.

Изобретение может функционировать в различных коммуникационных и компьютерных сетях, например, локальных компьютерных сетях, глобальной сети Интернет и т.д. Изобретение может также использоваться в распределенных вычислительных системах, где задачи выполняются удаленными вычислительными устройствами, которые объединены коммуникационной сетью, в том числе в прикладных программах, функционирующих на различных устройствах хранения данных.The invention can function in various communication and computer networks, for example, local computer networks, the global Internet, etc. The invention can also be used in distributed computing systems where tasks are performed by remote computing devices that are integrated by a communication network, including in application programs running on various data storage devices.

Промышленная применимость изобретения обусловлена оптимизацией поиска документов (фрагментов) близкой тематики, в том числе имеющих сходство не только в формальном, но и в смысловом значении и в устранении дублирования информации, хранимой в архиве. Причем эффективность поисковых операций определяется возможностью выбора методики формирования множества слов для поискового запроса, с использованием определенных правил, в том числе установленных пользователем. Промышленная применимость изобретения обусловлена также оптимизацией процесса обработки информации путем автоматической классификации документов, заносимых в архив.The industrial applicability of the invention is due to the optimization of the search for documents (fragments) of similar subjects, including those having similarities not only in formal but also in semantic meaning and in eliminating duplication of information stored in the archive. Moreover, the effectiveness of search operations is determined by the ability to choose a method for generating a set of words for a search query, using certain rules, including those established by the user. The industrial applicability of the invention is also due to the optimization of the information processing by automatic classification of documents recorded in the archive.

Данное изобретение применимо в системах, связанных с получением, поиском, обработкой и хранением информации в компьютерных системах, и повышает эффективность работы с информацией, хранимой на устройствах хранения данных.This invention is applicable to systems related to the receipt, retrieval, processing and storage of information in computer systems, and improves the efficiency of working with information stored on data storage devices.

Преимущества изобретенияAdvantages of the Invention

Настоящее изобретение обладает по сравнению с существующими аналогами рядом преимуществ, позволяющих существенно сократить временные затраты, требуемые для получения и обработки интересующих пользователя документов и их фрагментов. Изобретение позволяет осуществлять поиск документов (фрагментов), имеющих сходство с выбранным по текстовому и/или смысловому содержанию как с использованием всего множества слов, входящих в выбранный документ (фрагмент), так и установленного их количества. При этом эффективность поисковых операций повышается за счет возможности выбора методики формирования множества слов для поискового запроса, с использованием определенных правил, в том числе установленных пользователем.The present invention, in comparison with existing analogues, has a number of advantages that can significantly reduce the time required for obtaining and processing documents of interest to the user and their fragments. The invention allows you to search for documents (fragments) that are similar to the one selected for textual and / or semantic content using both the whole set of words included in the selected document (fragment) and their established number. At the same time, the efficiency of search operations is increased due to the possibility of choosing a method for generating a set of words for a search query, using certain rules, including those established by the user.

Изобретение определяет степень сходства выбранного и соответствующих ему найденных в архиве или на удаленных устройствах документов и осуществляет автоматическую классификацию документов (фрагментов) при занесении их в архив, хранимый на устройстве хранения данных.The invention determines the degree of similarity of the selected and the documents found in the archive or on the remote devices and automatically classifies the documents (fragments) when they are entered into the archive stored on the data storage device.

Изобретение также позволяет оптимизировать поиск документов по запросам, состоящим из двух и более слов с учетом интервала между словами и порядка чередования слов в запросе, за счет расширения упомянутых запросов путем предварительной обработки входящих в него слов. Это позволяет формировать список документов, релевантных запросу не только в формальном, но и в смысловом значении. Причем при отображении документов из полученного списка осуществляют визуализацию в них результатов, соответствующих параметрам поиска и запросу, полученному после предварительной обработки.The invention also allows to optimize the search for documents by queries consisting of two or more words, taking into account the interval between words and the order of alternating words in the query, by expanding the aforementioned queries by preliminary processing the words included in it. This allows you to create a list of documents relevant to the request not only in formal but also in semantic meaning. Moreover, when displaying documents from the resulting list, they visualize the results corresponding to the search parameters and the request received after preliminary processing.

Claims

CLAIM

1. The method of searching for fragments similar in text and / or semantic content in electronic documents stored on data storage devices, which consists in indexing each document stored in the archive, splitting said documents into fragments and forming topics from one or more fragments, determining search parameters, conducting a search, ranking the list of document fragments obtained as a result of the search, characterized in that as the search parameters, a plurality of the fragments included in the selected fragment are determined This is a document element of unique information blocks and is expanded by preliminary processing of each of the mentioned unique information blocks, where a unique information block is understood to mean an information block that occurs one or more times in a selected fragment of a document, where the operation of obtaining at least from one unique block of information, one or more blocks of information associated with a unique block of information with a given ratio.

2. The method according to claim 1, characterized in that the sequence of characters included in the document limited to certain characters is used as the information block.

3. The method according to claim 1, characterized in that as a fragment of the document use any selected sequence of information blocks included in the document.

4. The method according to claim 3, characterized in that a full copy of the document is used as a document fragment.

5. The method according to claim 3, characterized in that they establish the rules for splitting documents into fragments.

6. The method according to claim 5, characterized in that the document is divided into fragments using at least one rule.

7. The method according to claim 5, characterized in that the division of documents into fragments is carried out automatically.

8. The method according to claim 1, characterized in that the plurality of document fragments, among which search for fragments similar in text and / or semantic content to the selected one, is limited to indicating at least one rule by which documents were divided into fragments and / or subject matter.

9. The method according to claim 1, characterized in that the subjects are formed by users.

10. The method according to claim 1, characterized in that for the division of documents and document fragments into many blocks of information using a software module for processing documents.

11. The method according to claim 1, characterized in that at least one pre-processing operation is selected, wherein the sequence of pre-processing operations is set by users.

12. The method according to claim 11, characterized in that the preliminary processing consists of one logical operation of identity.

13. The method according to claim 1, characterized in that after the preliminary processing, at least one source block is removed from the request or left in the request.

14. The method according to claim 1, characterized in that the set of unique blocks of information defined as search parameters include all unique blocks of information included in the selected fragment of the document.

15. The method according to claim 1, characterized in that many unique blocks of information, defined as search parameters, are formed using one or more functions.

16. The method according to clause 15, characterized in that use the function of forming a list of unique information blocks for the selected fragment.

17. The method according to clause 15, characterized in that they use the function of determining the number of occurrences in the text content of fragments of the selected subject information blocks obtained in the process of pre-processing a unique block of information.

18. The method according to p. 15, characterized in that they use the function of determining the frequency of occurrence of unique blocks of information in the text content of fragments of the selected topic, where the frequency of occurrence is calculated as a percentage of the number of occurrences of the unique information block most often used in the text content of the selected topic.

19. The method according to claim 1, characterized in that the set of unique blocks of information defined as search parameters include a set number of unique blocks of information from the set generated by user-defined rules.

20. The method according to claim 1, characterized in that the search for fragments of documents is carried out on local and / or remote storage devices.

21. The method according to claim 20, characterized in that as the remote devices use any information resource or system designed to search for data, operating on a computer network and providing, in response to a search query, a list of document fragments that satisfy the search parameters.

22. The method according to claim 1, characterized in that the ranking of the resulting list of document fragments is carried out in accordance with the number of information block groups present in the found document fragments, combining information blocks obtained in the process of preliminary processing of a unique information block.

23. The method according to p. 22, characterized in that when ranking for those document fragments in which the number of groups of information blocks coincides, the number of occurrences in the text content of fragments of the selected subjects of the mentioned groups of information blocks is additionally determined.

24. The method according to claim 1, characterized in that they display the fragments included in the list of document fragments obtained as a result of the search, with visualization of the differences between their text content and the text content of the selected fragment.

25. The method according to claim 1, characterized in that the archived document is saved as a new version of a document previously stored in the archive that has a given degree of similarity in text and / or semantic content with the archived document, and wherein said new version is saved in the form of a full copy of the document entered into the archive or in the form of differences between the text content of the archived document and the text content of the aforementioned document stored earlier in the archive.

26. The method according to claim 1, characterized in that they automatically classify fragments stored on the document data storage device.

27. The method of searching for electronic documents stored on data storage devices, which consists in indexing each document stored in the archive, determining the search parameters for electronic documents, including generating an initial request from two or more blocks of information, indicating the maximum number of blocks of information that may be between the two two or more blocks of information in the search document (interval), as well as the alternation of the above two or more blocks of information in the search document in the specified interval, ranking the resulting list of documents, characterized in that the initial request formed from two or more blocks of information is expanded by pre-processing one or more blocks of information included in the original request, where the operation of obtaining is used as preliminary processing of the initial request, at least one block of information included in the original request, one or more blocks of information associated with the source block of information specified In this way, the search for documents is carried out using any given number of information blocks defined as search parameters when forming the initial request.

28. The method according to p. 27, characterized in that as a block of information used in the document a sequence of characters limited to certain characters.

29. The method according to item 27, wherein in order to split the documents into many blocks of information using a software module for processing documents.

30. The method according to item 27, wherein at least one operation of the preliminary processing of the initial request is selected, and the sequence of operations for the preliminary processing of the initial request is set by users.

31. The method according to p. 30, characterized in that the preliminary processing of the original request consists of one logical operation of identity.

32. The method according to item 27, wherein after the pre-processing operation, at least one block of information included in the original request is removed from the request or left in the request.

33. The method according to item 27, wherein the search for documents is carried out on local and / or remote data storage devices.

34. The method according to p. 33, characterized in that as the remote devices use any information resource or system designed to search for data, operating on a computer network and providing, in response to a search query, a list of documents satisfying the search parameters.

35. The method according to item 27, wherein the resulting list of documents is sorted in accordance with the number

FIG. one

FIG. 2 by entering document sequences of information blocks that satisfy the search parameters, the length of the intervals in which the mentioned sequences are included, and taking into account the weighting factors assigned to the information blocks.

36. The method according to clause 35, wherein the documents are displayed with visualization of sequences of information blocks corresponding to the search parameters and the query received after preliminary processing, and navigation is made through the text content of documents using the sequences of information blocks contained in them the limits of the entire list of documents obtained as a result of the search.