RU2305314C1

RU2305314C1 - Method for finding and selecting information in various databases

Info

Publication number: RU2305314C1
Application number: RU2006122998/09A
Authority: RU
Inventors: Алексей Владимирович Баранов (RU); Алексей Владимирович Баранов; нцев Игорь Николаевич Бр (RU); Игорь Николаевич Брянцев; Игорь Михайлович Жевлаков (RU); Игорь Михайлович Жевлаков; Олег Петрович Ковалев (RU); Олег Петрович Ковалев; занкина Надежда Ивановна Р (RU); Надежда Ивановна Рязанкина
Original assignee: Общество с ограниченной ответственностью "Центр Компьютерного моделирования"
Priority date: 2006-06-28
Filing date: 2006-06-28
Publication date: 2007-08-27

Abstract

FIELD: technology for finding and identifying documents based on their descriptions, present in various databases and information resources with different document creation standards.

SUBSTANCE: in accordance to the invention, search requests formed by user are dispatched into search system of server, which processes aforementioned requests by selecting documents from various databases, searching system combines all selected documents in single list, sorts aforementioned selected documents based on topics, creates folders, which contains aforementioned documents of one topic, then aforementioned sorted documents are sorted again with consideration of final rating. After that on basis of user request sections of future report are determined, by means of searching system, text signs of beginning and end of sections are determined, text of documents selected for greatest final rating is marked up, inside each section text segments are selected, segments are sorted according to publishing data, final report is prepared, in which text segments, sorted according to publishing date of original document, are combined in single text array, after that final report is dispatched to user terminal through telecommunication communication means.

EFFECT: increased precision of searching and analyzing of received information.

4 dwg, 1 tbl

Description

Изобретение относится к средствам поиска и идентификации документов по их описаниям, находящимся в различных базах данных и информационных ресурсах с различными стандартами формирования документов, а также может использоваться при наполнении базы данных фрагментами изначально неструктурированных текстов.The invention relates to means for searching and identifying documents according to their descriptions, which are in various databases and information resources with various standards for generating documents, and can also be used when filling a database with fragments of initially unstructured texts.

Известны способы идентификации документов по их описаниям, заключающиеся в преобразовании текстов естественного языка в заданных областях знаний в сигналы, пригодные для машинной обработки, в формировании запроса в виде выборки ключевых слов и в сравнении выборки ключевых слов запроса с тезаурусами текстов, хранящихся в базе данных (см. патенты РФ №2107942, 2167450, патент США №6460034, поисковая база данных Яндекс).Known methods for identifying documents by their descriptions, which include converting natural language texts in given areas of knowledge into signals suitable for machine processing, generating a query in the form of a selection of keywords and comparing a selection of query keywords with thesauri of texts stored in a database ( see RF patents No. 2107942, 2167450, US patent No. 6460034, Yandex search database).

Недостатком известных способов является ограниченность одной базой данных с известным стандартом формирования.A disadvantage of the known methods is the limitedness of one database with a known standard for the formation.

Наиболее близким аналогом, принятым за прототип, является способ поиска и выборки информации из баз данных, описанный в патенте RU 2236699, в соответствии с которым осуществляют: формирование пользователем на своем рабочем месте, представляющем собой любой персональный компьютер, имеющий доступ к различным базам данных, по меньшей мере одного поискового запроса; передачу сформированного пользователем запроса в поисковую систему; обработку поисковой системой сформированных пользователем поисковых запросов путем выбора документов из баз данных; причем поисковая система сортирует упомянутые выбранные документы по тематикам и формирует папки, каждая из которых содержит упомянутые документы, отсортированные по одной тематике; для каждого отсортированного документа выделяют признаки, характеризующие этот документ; внутри каждой папки поисковая система определяет рейтинг каждого признака, содержащегося в каждом отсортированном документе; после чего поисковая система определяет число совпадений признаков отдельных отсортированных документов одной папки с признаками других документов, содержащихся в других папках; определяет окончательный рейтинг каждого отсортированного документа с учетом числа совпадений признаков и с учетом весового коэффициента базы данных; после чего поисковая система снова сортирует упомянутые отсортированные документы с учетом окончательного рейтинга и направляет отсортированные в соответствии с окончательным рейтингом документы на рабочее место пользователя.The closest analogue adopted for the prototype is the method of searching and retrieving information from databases described in patent RU 2236699, in accordance with which they carry out: forming by the user at his workplace, which is any personal computer that has access to various databases, at least one search query; transmitting a user-generated request to a search engine; Search engine processing user-generated search queries by selecting documents from databases; moreover, the search system sorts the mentioned selected documents by subject and generates folders, each of which contains the mentioned documents, sorted by one subject; for each sorted document, features characterizing this document are distinguished; inside each folder, the search system determines the rating of each feature contained in each sorted document; after which the search system determines the number of matches of signs of individual sorted documents in one folder with signs of other documents contained in other folders; determines the final rating of each sorted document, taking into account the number of matches of signs and taking into account the weight coefficient of the database; after which the search system again sorts the mentioned sorted documents taking into account the final rating and sends the documents sorted in accordance with the final rating to the user's workplace.

Недостатком прототипа является отсутствие структурной обработки и анализа полученных документов по их значимости применительно к заданному элементу запроса. Равноценность всех выбранных объектов и документов приводит к росту объема отобранной информации и росту информационного шума, что в конечном счете увеличивает затраты интеллектуального труда на обработку отобранной информации пользователем.The disadvantage of the prototype is the lack of structural processing and analysis of the received documents by their significance in relation to a given query element. The equivalence of all selected objects and documents leads to an increase in the volume of selected information and an increase in information noise, which ultimately increases the cost of intellectual labor for processing the selected information by the user.

Кроме того, в случае работы с множеством хранилищ документов с различными стандартами формирования документов идентификация объектов становится трудно выполнимой.In addition, in the case of working with many document repositories with different standards for document generation, the identification of objects becomes difficult to do.

Техническим результатом заявленного изобретения является расширение функциональных возможностей путем повышения точности поиска и проведения анализа полученной информации.The technical result of the claimed invention is the expansion of functionality by increasing the accuracy of the search and analysis of the information received.

Технический результат достигается за счет того, что в способе поиска и выборки информации из различных баз данных, включающем формирование пользователем на пользовательском компьютере, по меньшей мере, одного поискового запроса, передачу сформированного пользователем запроса через телекоммуникационные средства связи в поисковую систему сервера, обработку поисковой системой сформированных пользователем поисковых запросов путем выбора документов из различных баз данных, дополнительно поисковая система объединяет все выбранные документы в единый список, сортирует упомянутые выбранные документы по тематикам, формирует папки, каждая из которых содержит упомянутые документы, отсортированные по одной тематике, для каждой папки определяет параметры оценки релевантности, для каждого отсортированного документа каждой папки, выделяет признаки, характеризующие каждый документ, внутри каждой папки определяет рейтинг каждого признака, содержащегося в каждом отсортированном документе, определяет окончательный рейтинг каждого выбранного документа с учетом рейтинга каждого признака документа и параметра оценки релевантности папки, снова сортирует упомянутые отсортированные документы с учетом окончательного рейтинга, отсортированные в соответствии с окончательным рейтингом документы запоминает в памяти поисковой машины, выбирает заданное пользователем в запросе количество документов с наибольшими показателями окончательного рейтинга, осуществляет структурную обработку выбранного количества документов для подготовки итогового отчета, формирует график, показывающий зависимости количества отсортированных в соответствии с окончательным рейтингом документов от текущего времени, передает на пользовательский терминал через телекоммуникационные средства связи итоговый отчет и сформированный график.The technical result is achieved due to the fact that in the method of searching and retrieving information from various databases, including the generation by the user on the user computer of at least one search query, the transmission of the user-generated request through telecommunication means to the server’s search system, processing by the search system user-generated search queries by selecting documents from various databases, in addition, the search system combines all the selected documents copes into a single list, sorts the mentioned selected documents by topics, forms folders, each of which contains the mentioned documents, sorted by one topic, for each folder determines the relevance assessment parameters, for each sorted document of each folder, selects the characteristics that characterize each document, inside of each folder determines the rating of each feature contained in each sorted document, determines the final rating of each selected document, taking into account the rating of each of the document’s characteristic and the folder relevance rating parameter, it sorts the mentioned sorted documents again based on the final rating, stores the documents sorted according to the final rating in the memory of the search engine, selects the number of documents specified by the user in the request with the highest final rating, performs structural processing of the selected number documents for the preparation of the final report, forms a graph showing the dependence of the number of sort documents in accordance with the final rating of documents from the current time, transmits to the user terminal via telecommunication means the final report and the generated schedule.

При анализе тенденций и мониторинге той или иной предметной области, основанных на анализе неструктурированной информации (книги, статьи, обзоры и т.д.), пользователь сталкивается со следующими основными проблемами:When analyzing trends and monitoring a particular subject area, based on the analysis of unstructured information (books, articles, reviews, etc.), the user encounters the following main problems:

1. Большой объем информации. В традиционных предметных областях объем публикаций измеряется сотнями тысяч и даже миллионами единиц публикаций. Такой объем материалов пользователь не может физически прочитать. Таким образом, выводы о состоянии предметной области ему приходится делать на основании выборки. При этом ни один пользователь не может быть до конца уверен в том, что не существуют значимые материалы, опровергающие его точку зрения, которые он просто не нашел.1. A large amount of information. In traditional subject areas, publications are measured in hundreds of thousands or even millions of publications. The user cannot physically read such a volume of materials. Thus, he has to draw conclusions about the state of the subject area based on the sample. Moreover, no user can be completely sure that there are no significant materials that refute his point of view, which he simply could not find.

2. Релевантность материалов. Практически все поисковые системы (Google, Yahoo, Yandex и др.) выдают результаты поиска в виде единого списка. В этих списках по формальным принципам (число ссылок, концентрация слова, посещаемость и т.д.) определяется релевантность (ценность) материалов. Однако в повседневной деятельности пользователь никогда не использует эти принципы при сравнении разнородных предметов. Книгу не сравнивают с человеком, а компанию с событием, патент с новостями. В одних случаях для пользователя важна одна информация, например статья, в других случаях это может быть иная информация - материалы конференции или патент.2. Relevance of materials. Almost all search engines (Google, Yahoo, Yandex, etc.) give search results in a single list. In these lists, according to formal principles (number of links, word concentration, traffic, etc.), the relevance (value) of the materials is determined. However, in everyday activities, the user never uses these principles when comparing dissimilar objects. The book is not compared with a person, but a company with an event, a patent with news. In some cases, the user is interested in one information, for example an article, in other cases it may be other information - conference materials or a patent.

3. Неполнота информации. Как правило, пользователь имеет дело с набором разнородной информации (текстами): статьями, книгами, материалами конференций, новостями и т.д. Отобранные для работы материалы часто относятся к разным периодам времени развития предметной области, написаны разными авторами, представляющими различные организации, с различными интересами, целями, задачами, уровнем развития и компетенции. Кроме того, эти тексты различаются по своей структуре, глубине, терминологии, объему данных. Различия могут носить принципиальный характер, точки зрения могут противоречить одна другой даже у одного автора, позиция которого часто меняется от работы к работе в зависимости от времени.3. Incompleteness of information. As a rule, the user is dealing with a set of diverse information (texts): articles, books, conference materials, news, etc. The materials selected for work often relate to different periods of time for the development of the subject area, written by different authors representing different organizations, with different interests, goals, objectives, level of development and competence. In addition, these texts differ in their structure, depth, terminology, and data volume. The differences may be fundamental, the points of view may contradict one another even for one author, whose position often changes from work to work depending on time.

Заявленный способ направлен на решение вышеуказанных проблем.The claimed method is aimed at solving the above problems.

Сущность заявленного изобретения заключается в том, что после формирования пользователем на пользовательском компьютере поискового запроса и передачу его через телекоммуникационные средства связи в поисковую систему сервера, обработки поисковой системой сформированного пользователем поискового запроса путем выбора документов из различных баз данных осуществляют следующие этапы.The essence of the claimed invention lies in the fact that after a user forms a search query on a user computer and transmits it through telecommunication means to the server’s search system, the search engine processes the search query generated by the user by selecting documents from various databases, the following steps are performed.

1 этап. Отбор информации.Stage 1. The selection of information.

Первоначально пользователь в различных поисковых системах, например Yandex, Yahoo, Google и др., отбирает произвольное количество документов по порядку, который предоставляет ему поисковая система в соответствии с ее собственными правилами. Как отмечалось выше, в этом списке ссылок находятся разнородные по своей природе материалы: статьи, патенты, компании, новости и т.д. Очевидно, что в реальной жизни мы не сравниваем человека и патент или книгу с компанией. Ценность того или иного материала зависит от различных субъективных факторов, интересов пользователя в данном конкретном случае.Initially, a user in various search engines, for example Yandex, Yahoo, Google, etc., selects an arbitrary number of documents in the order that the search engine provides him in accordance with its own rules. As noted above, this list of links contains materials of diverse nature: articles, patents, companies, news, etc. Obviously, in real life we do not compare a person and a patent or book with a company. The value of a particular material depends on various subjective factors, the interests of the user in this particular case.

В способе предлагается:The method proposes:

Объединить с помощью поисковой системы все найденные материалы в единый список без учета порядка их появления в списках различных поисковых систем;Combine using the search engine all the materials found in a single list without taking into account the order of their appearance in the lists of various search engines;

Отсортировать с помощью поисковой системы документы по тематикам, а именно собрать в отдельные папки (группы) однородные по своей природе материалы (все книги в папку «Книги», все патенты в папку «Патенты», информацию о специалистах в папку «Персоналии» и т.д.).Sort documents by subject using a search engine, namely, to collect materials of homogeneous nature into separate folders (groups) (all books in the Books folder, all patents in the Patents folder, information about specialists in the Personalities folder, etc. .d.).

Количество папок может быть произвольное, но в каждую папку должны входить однородные документы, существующие как объекты в реальной жизни.The number of folders can be arbitrary, but each folder should include homogeneous documents that exist as objects in real life.

Пример набора папок:An example of a set of folders:

Популярные материалы (Введение в тему)Popular materials (Introduction to the topic)

Новостиnews

Новостные источникиNews sources

СобытияDevelopments

ОрганизацииThe organization

ПерсоналииPersonalities

ПорталыPortals

Периодические изданияPeriodicals

КнигиBooks

ОбзорыReviews

Определить для каждой папки параметры оценки релевантности (важности) того или иного материала.Define for each folder the parameters for assessing the relevance (importance) of a particular material.

Например, папка «Книги». Параметрами оценки являются: индекс цитирования, оценка читателей в Интернет-магазинах, год издания, известность автора, язык произведения и др.For example, the “Books” folder. Evaluation parameters are: citation index, readers rating in online stores, year of publication, fame of the author, language of the work, etc.

В папке «Компании» параметрами могут быть: известность компании, специализация в данной предметной области, капитализация, регион, страна и т.д.In the “Companies” folder, the parameters can be: the company’s popularity, specialization in a given subject area, capitalization, region, country, etc.

Число параметров, которое может ввести пользователь, не ограничено. Однако на практике количество важных параметров редко превышает число 7. Эмпирическим путем установлено, что в большинстве практически значимых случаев оптимальным числом параметров является 4.The number of parameters that a user can enter is unlimited. However, in practice, the number of important parameters rarely exceeds the number 7. Empirically it has been established that in most practically significant cases the optimal number of parameters is 4.

Каждому параметру в каждой папке присваивается определенная степень важности, называемая весом параметра и принимающая значения от 0 до 1.Each parameter in each folder is assigned a certain degree of importance, called the weight of the parameter and taking values from 0 to 1.

Устанавливаются для каждого параметра в каждой папке целочисленные значения уровня в зависимости от его реального значения.For each parameter in each folder, integer level values are set depending on its real value.

Например, для папке «Книги».For example, for the “Books” folder.

Параметр «индекс цитирования»Citation Index

более или равно 10-5 баллов,more than or equal to 10-5 points,

более 2, но меньше 10-4 балла,more than 2, but less than 10-4 points,

1-2-3 балла;1-2-3 points;

0-2 балла.0-2 points.

Параметр «Язык публикации»Publication Language Option

«Русский» - 5 баллов;"Russian" - 5 points;

«Английский» - 4 балла;"English" - 4 points;

«Немецкий» - 3 балла;“German” - 3 points;

«Японский» - 2 балла;"Japanese" - 2 points;

«Арабский» -1 балл."Arabic" -1 point.

Параметр «Год издания»Year of publication option

2004-2006 - 5 баллов;2004-2006 - 5 points;

2000-2003 - 4 балла;2000-2003 - 4 points;

1995-1999 - 3 балла1995-1999 - 3 points

1980-1995 - 2 балла;1980-1995 - 2 points;

до 1980 - 1 балл.until 1980 - 1 point.

Каждый пользователь может настраивать систему параметров и присваивать веса в соответствии со своими предпочтениями. Уровень параметра для каждого документа в папке напрямую зависит от реальных значений параметра для данного документа.Each user can customize the parameter system and assign weights in accordance with their preferences. The parameter level for each document in the folder directly depends on the actual parameter values for this document.

Определить рейтинг i-того документа в j-той папке по формуле:Determine the rating of the i-th document in the j-th folder according to the formula:

m_j - число параметров в j-той папке;m _j is the number of parameters in the j-th folder;

a_k ^j - значение веса k-того параметра в j-той папке;a _k ^j - weight value of the k-th parameter in the j-th folder;

p_k ^ij - реально определяемое значение уровня k-того параметра для i-того элемента в j-той папке;p _k ^ij is the really determined level value of the k-th parameter for the i-th element in the j-th folder;

с_i ^j - количество групп, в которых упоминается i-тый документ в j-той папке.with _i ^j - the number of groups in which the i-th document in the j-th folder is mentioned.

Далее в каждой папке отбираются только первые несколько документов с максимальными в папке показателями релевантности.Then, in each folder, only the first few documents with the maximum relevance indicators in the folder are selected.

Отсортированные поисковой системой в соответствии с окончательным рейтингом документы запоминают в памяти поисковой машины.Documents sorted by the search engine in accordance with the final rating are stored in the memory of the search engine.

Первоначальный отбор документов, сортировка их по папкам (группам), вычисление релевантности каждого элемента и выбор конечного количества документов с большими показателями релевантности завершают 1 этап работы.The initial selection of documents, sorting them into folders (groups), calculating the relevance of each element and selecting the final number of documents with high relevance indicators complete the 1st stage of work.

2 этап. Структурная обработка.2 stage. Structural processing.

Отсортированные поисковой системой в соответствии с окончательным рейтингом документы запоминают в памяти поисковой машины. Для реализации 2 этапа необходимо:Documents sorted by the search engine in accordance with the final rating are stored in the memory of the search engine. To implement the 2 stages, you must:

1. Определить разделы будущего итогового отчета. Например, Цели, Задачи, Прогнозы, Текущие результаты и другие.1. Identify sections of the future final report. For example, Goals, Tasks, Forecasts, Current Results and others.

2. Определить ключевые слова, характерные для каждого из разделов.2. Identify keywords specific to each section.

3. Определить текстовые признаки начала и завершения раздела.3. Identify the text signs of the beginning and end of the section.

4. Провести разметку текста отобранных на первом этапе документов в соответствии с разделами. Выделить сегменты текста, перенести эти сегменты в базу данных в соответствующий им раздел.4. Mark the text of the documents selected at the first stage in accordance with the sections. Select text segments, transfer these segments to the database in the corresponding section.

5. Внутри каждого раздела провести сортировку сегментов. Сортировка сегментов проводится в соответствии с датой публикации оригинального документа. Сегменты, в которых присутствуют взаимные цитаты, размещаются последовательно друг за другом, начиная с сегмента с более ранней датой.5. Inside each section, sort the segments. Segments are sorted according to the publication date of the original document. Segments in which reciprocal quotes are present are placed sequentially one after another, starting from a segment with an earlier date.

6. Осуществлять пополнение базы данных новыми документами (осуществлять мониторинг).6. Carry out replenishment of the database with new documents (carry out monitoring).

7. Сформировать отчет о проблеме.7. Generate a report on the problem.

В любой момент времени пользователь может сформировать итоговый отчет о проблеме. При формировании отчета все сегменты раздела объединяются в единый текстовый массив с ссылками на первоисточники.At any time, the user can generate a summary report of the problem. When generating a report, all segments of the section are combined into a single text array with links to the source.

Вышеуказанный способ отбора информации и ее структурная обработка могут также быть использованы при наполнении базы данных фрагментами изначально неструктурированных текстов.The above method of selecting information and its structural processing can also be used when filling the database with fragments of initially unstructured texts.

3 этап. Анализ трендов.3 stage. Trend analysis.

Большинство предметных областей развиваются в соответствии с G-кривой (фиг.1.). Эту кривую впервые опубликовала аналитическая корпорация Gartner Group в 1995 году.Most subject areas develop in accordance with the G-curve (Fig. 1.). This curve was first published by the analytic corporation Gartner Group in 1995.

G-кривая отображает зависимость количества информационных сообщений документов о технологии от времени и уровня развития технологии. Под информационными сообщениями подразумеваются любые упоминания технологии в СМИ, литературе, Интернете и других источниках информации. Текст каждого упоминания считается документом.The G-curve shows the dependence of the number of information messages of technology documents on time and the level of technology development. Information messages mean any mention of technology in the media, literature, the Internet and other sources of information. The text of each reference is considered a document.

Кривая условно делится на пять участков:The curve is conditionally divided into five sections:

1 участок - старт технологий (появление новой концепции). 1 site - the start of technology (the appearance of a new concept).

На этом участке происходит быстрый рост количества документов. Это объясняется значительным увеличением инвестиций в данную технологию, ростом рекламы, ростом числа участников в разработке и продвижении технологии.On this site there is a rapid increase in the number of documents. This is due to a significant increase in investment in this technology, an increase in advertising, an increase in the number of participants in the development and promotion of technology.

2 участок - пик чрезмерных ожиданий.Section 2 - the peak of excessive expectations.

На этом участке прекращается рост числа публикаций, снижается объем рекламы, прекращается рост числа участников.At this site, the growth in the number of publications stops, the volume of advertising decreases, the growth in the number of participants stops.

3 участок - разочарование.Section 3 is a disappointment.

Снижается количество публикуемых документов, снижается объем рекламы, многие участники покидают бизнес. Из предметной области уходят инвестиции. Затраты направляются на поиск качественных изменений технологии.The number of published documents is reduced, the volume of advertising is reduced, many participants leave the business. Investments are leaving the subject area. Costs are directed to the search for qualitative changes in technology.

4 участок - уклон просвещения.Section 4 - the bias of education.

Найдены качественные решения проблем. Инвестиции начинают расти. Увеличивается объем документов. Нарастает объем рекламы.Found quality solutions to problems. Investments are starting to grow. The volume of documents is increasing. The volume of advertising is growing.

5 участок - плато продуктивности.Section 5 - productivity plateau.

Разработана технология с оптимальными параметрами. Число документов стабильно нарастает. Оптимизируются рекламные бюджеты. Растет объем инвестиций.A technology with optimal parameters has been developed. The number of documents is steadily increasing. Optimized advertising budgets. The volume of investments is growing.

Приступая к анализу новой для себя предметной области, пользователь сталкивается с тем, что не может изначально определить, какому участку G-кривой соответствует данный информационный материал (документ) (фиг.2).Starting to analyze a new subject area, the user is faced with the fact that he cannot initially determine which section of the G-curve corresponds to this information material (document) (figure 2).

Для решения этой задачи было бы хорошо сравнить информационные материалы, относящиеся к разным интервалам времени (фиг.3). Однако на практике реализовать эту задачу сложно, так как большинство информационных материалов имеют различную структуру и полноту описания. Например, в одних отчетах обсуждаются цели, задачи и перспективы развития, в других - проблемы и технические характеристики, в третьих рассматривается инвестиционная привлекательность проекта. Определить тренд развития по отдельно взятым материалам в финансовой, технической, научной областях, относящихся к различным временным промежуткам, очень сложно. Кроме того, в отобранных материалах численные значения каких-либо характеристик изучаемой технологии могут отсутствовать вообще.To solve this problem, it would be good to compare information materials related to different time intervals (figure 3). However, in practice it is difficult to realize this task, since most information materials have a different structure and completeness of description. For example, in some reports the goals, objectives and development prospects are discussed, in others - problems and technical characteristics, in the third they consider the investment attractiveness of the project. It is very difficult to determine the development trend using individual materials in the financial, technical, scientific fields related to different time periods. In addition, in the selected materials, the numerical values of any characteristics of the technology under study may be absent altogether.

Способ позволяет построить G-кривую, выявить тренд при наличии обрывочных данных вербального описания характеристик состояния развития технологии.The method allows you to build a G-curve, identify a trend in the presence of fragmentary data of a verbal description of the characteristics of the state of technology development.

Для решения задачи по построению G-кривой необходимо проделать следующие операции:To solve the problem of constructing a G-curve, it is necessary to do the following operations:

1. В каждом разделе, созданном на 2 этапе работы, проводится оценка соответствующих характеристик. Как отмечалось выше, все собранные сегменты оригинальных документов внутри раздела размещаются последовательно в соответствии с датой публикации оригинального документа. Для формализации вербальных характеристик технологии проводится попарное сравнение рядом стоящих сегментов.1. In each section created at the 2nd stage of work, an assessment of the relevant characteristics is carried out. As noted above, all collected segments of the original documents within the section are placed sequentially in accordance with the date of publication of the original document. To formalize the verbal characteristics of the technology, pairwise comparison of adjacent segments is carried out.

Если качественные характеристики предметной области усиливаются, то в таблицу результатов обработки заносится +1, если ослабляются, то -1. После заполнения таблицы по ее данным строится G-кривая (фиг.4).If the qualitative characteristics of the subject area are amplified, then +1 is entered in the table of processing results; if they are weakened, then -1. After filling out the table according to its data, a G-curve is constructed (figure 4).

2. Для упрощения процесса формализации используется таблица правил:2. To simplify the formalization process, a table of rules is used:

№ п.п.No. p.p. Правило формализацииFormalization rule ЗначениеValue 1.one. Число пунктов в последующем сегменте документа увеличилосьThe number of points in the subsequent segment of the document increased +1+1 2.2. Текст документов повторяетсяThe text of the documents is repeated. -1-one 3.3. Появились даты событийEvent Dates Appear +1+1 4.four. Указаны численные параметрыIndicated numerical parameters +1+1 5.5. Присутствуют слова «переносятся», «не удовлетворяют», «не соответствуют»There are words “transferred”, “do not satisfy”, “do not match” -1-one ...... ............ nn Если А..., то В...If A ... then B ...

Если условия Правил из таблицы имеют противоположное значение, то и численное значение сегмента меняется на противоположное.If the conditions of the Rules from the table have the opposite meaning, then the numerical value of the segment changes to the opposite.

Пользовательский компьютер представляет собой любой персональный компьютер, например, компании IBM, состоящий из системного блока, к которому подключен монитор, клавиатура и манипулятор типа "мышь".A user computer is any personal computer, for example, of IBM, consisting of a system unit to which a monitor, keyboard, and mouse-type manipulator are connected.

Пользовательский компьютер должен иметь доступ к базам данных, которые могут быть как удаленными, так и локальными. Доступ к базам данных можно осуществить посредством подключения пользовательского компьютера через телекоммуникационные средства связи к поисковой системе сервера глобальной сети Интернет или локальной сети, например Intranet.The user computer must have access to databases, which can be both remote and local. Access to databases can be achieved by connecting a user computer via telecommunication means to a search engine of a global Internet server or local area network, for example Intranet.

Базы данных могут быть как однородными, каждая из которых содержит документы только по одной тематике, например патентная база данных, так и неоднородными, которые содержат документы по разным тематикам, например Яндекс.Databases can be both homogeneous, each of which contains documents on only one subject, for example, a patent database, and heterogeneous, which contain documents on various topics, for example Yandex.

Базы данных записаны в памяти компьютера или сервера, например, на жестком диске.Databases are stored in the memory of a computer or server, for example, on a hard disk.

Поисковая машина представляет собой обычную 32-битовую машину (например, Linux, Solaris, Free BSD, Win32).The search engine is a regular 32-bit machine (for example, Linux, Solaris, Free BSD, Win32).

В качестве поисковой системы используется, например, поисковая система Fast, реализующая известную логику прямого поиска. Поисковая система Fast разработана и поставляется на рынок норвежской компанией "Fast Search & Transfer ASA".As a search engine, for example, the Fast search engine is used, which implements the well-known direct search logic. Fast search engine is designed and marketed by the Norwegian company Fast Search & Transfer ASA.

Применение способа позволяет также сократить машинное время поиска, повысить релевантность выборки документов запросу, снизить затраты интеллектуального труда при анализе выборки документов.The application of the method also allows to reduce machine search time, increase the relevance of the selection of documents to the query, reduce the cost of intellectual labor in the analysis of the selection of documents.

Claims

A method for searching and retrieving information from various databases, including generating at least one search query by a user on a user computer, transmitting a user-generated request via telecommunication means to a server search system, processing a search system by user-generated search queries by selecting documents from various databases, the search engine combines all the selected documents into a single list, sorts the mentioned selected documents by those to mathematicians, forms folders, each of which contains the mentioned documents, sorted by one topic, for each folder determines the relevance assessment parameters, for each sorted document of each folder, selects the characteristics that characterize each document, inside each folder determines the rating of each characteristic contained in each sorted document, determines the final rating of each selected document, taking into account the rating of each attribute of the document and the parameter for assessing the relevance of the folder, again with sorts the mentioned sorted documents taking into account the final rating, stores the documents sorted in accordance with the final rating in the memory of the search engine, selects the number of documents specified by the user in the request with the highest final rating, characterized in that they perform structural processing of the selected number of documents containing steps which the user in the search query determines sections of the future final report and keywords specific to of each section, using the search engine, determine the text signs of the beginning and end of the sections, carry out the markup of the text of the documents selected with the highest indicators of the final rating, select segments of text within each section, sort the segments according to the publication date, prepare a final report in which the segments text sorted in accordance with the publication date of the original document, combined into a single text array, and then passed to the user term Inal through the telecommunication means of communication the final report.