RU2007141666A

RU2007141666A - METHOD FOR COLLECTING, PROCESSING, AND CATALOGIZING TARGET INFORMATION FROM UNSTRUCTURED SOURCES

Info

Publication number: RU2007141666A
Application number: RU2007141666/09A
Authority: RU
Inventors: Николай Игоревич Докучаев (RU); Николай Игоревич Докучаев; Антон Валентинович Новиков (RU); Антон Валентинович Новиков; Сергей Николаевич Ряжских (RU); Сергей Николаевич Ряжских
Original assignee: Николай Игоревич Докучаев (RU); Николай Игоревич Докучаев; Антон Валентинович Новиков (RU); Антон Валентинович Новиков; Сергей Николаевич Ряжских (RU); Сергей Николаевич Ряжских
Priority date: 2007-11-13
Filing date: 2007-11-13
Publication date: 2009-05-20

Abstract

1. Способ сбора, обработки и каталогизации целевой информации из неструктурированных источников, по которому клиентами формулируется задача по поиску и отбору из информационных сетей соответствующей их запросу информации, посредством регистрации на сайте компании, осуществляющей сбор и анализ такой информации, производится идентификация клиента, клиенту предлагается тема или перечень тем, которые предварительно определяются и настраиваются экспертным путем, предварительно формируют базу контрольных информационных признаков, подлежащих выявлению в информационном потоке, принимают информационный поток, т.е. электронные документы, отобранные с информационных ресурсов, последовательно обрабатывают электронные документы из информационного потока, выделяют из поступившего на обработку электронного документа список элементов и список слов, используя лексический анализ текстовой информации, обеспечивающий подготовительную нормализацию обрабатываемых электронных документов, выделяют по установленным правилам информационные признаки, сравнивают их с контрольными информационными признаками из базы данных, содержащей всю справочную информацию, включающую все морфологические и семантические характеристики словосочетаний, а также слова-синонимы и тематически связанные слова, по результатам сравнения фиксируют наличие или отсутствие в каждом поступившем на обработку электронном документе идентификационных признаков, подлежащих выявлению, на основе этого анализа принимается решение о дальнейшей обработке электронных документов, проводят обработку этих документов с использованием детального м1. A method for collecting, processing and cataloging target information from unstructured sources, according to which the clients formulate the task of searching and selecting information corresponding to their request from information networks, by registering on the company’s website collecting and analyzing such information, the client is identified, the client is invited to a topic or a list of topics that are pre-determined and configured by experts, pre-form a database of control information features, next to aschih identify in the information flow, receiving an information flow, i.e. electronic documents selected from information resources sequentially process electronic documents from the information stream, select a list of elements and a list of words from the electronic document received for processing, using lexical analysis of text information that provides preparatory normalization of processed electronic documents, select information signs according to established rules, compare them with control information signs from a database containing all the reference information According to the results of comparison, the presence, including all morphological and semantic characteristics of phrases, as well as synonyms and thematically related words, fixes the presence or absence of identification attributes to be identified in each electronic document received, based on this analysis, a decision is made on further processing of electronic documents, carry out the processing of these documents using the detailed m

Claims

1. A method for collecting, processing and cataloging target information from unstructured sources, according to which the clients formulate the task of searching and selecting information corresponding to their request from information networks, by registering on the company’s website collecting and analyzing such information, the client is identified, the client is invited to a topic or a list of topics that are pre-determined and configured by experts, pre-form a database of control information features, next to aschih identify in the information flow, receiving an information flow, i.e. electronic documents selected from information resources sequentially process electronic documents from the information stream, select a list of elements and a list of words from the electronic document received for processing, using lexical analysis of text information that provides preparatory normalization of processed electronic documents, select information signs according to established rules, compare them with control information signs from a database containing all the reference information According to the results of comparison, the presence, including all morphological and semantic characteristics of phrases, as well as synonyms and thematically related words, fixes the presence or absence of identification attributes to be identified in each electronic document received, based on this analysis, a decision is made on further processing of electronic documents, process these documents using detailed morphological, syntactic and semantic analysis, and based on The data processing process for these documents determines whether the information in these electronic documents belongs to a particular predetermined topic, and based on the statistical information received during processing of electronic documents, statistical and analytical reports are created, characterized in that to determine the ownership of the electronic document being studied or only parts to certain topics, a hierarchically built class tree is used, where the fact of discovering a lower-level class leads to the fact When classes of the upper level are over the found class, the order of class calculation is determined, determined by priorities, the choice and purpose of which depends on the entities used by the classes to describe the topic, class intersections are determined, which means the simultaneous presence of two or more base classes in one linguistic zone, when calculating each class is set by the depth of nesting, determined by the task of kinship for each class, phrases that define the classes, weight coefficients, zones of influence for classes are established, classes that can be used to define several topics are defined, killer phrases are defined that remove statistical information on the area occupied by phrases from the further processing of zones of influence of classes and correspondingly calculate their areas, for classes which in the class settings were specified phrases-killers, killer classes are defined, which, being in the influence zone of classes with priority "0", "1" and "2", are deleted from the following processing statistical information about the areas occupied by these classes of classes, based on the information about the occupied areas remaining after checking the classes obtained during processing of the document, a decision is made to classify this or that part of the processed document as a topic and in what volume, to determine the volume it is determined the total area and / or the relative area that they occupy in the document being processed, in this case, if the value of the area size or the percentage of the relative area of the part of the document if it exceeds or is equal to the size of the area set for a particular topic in their settings, then the document will be assigned to a particular topic, otherwise it is considered that in this document the mention of a topic was encountered by chance or too little, and this document will not be related to the topic, also when calculating the area of classes, the fact that the set of characters represents the element in the zone of influence of the class is also taken into account, also when calculating the areas occupied by phrases determined by the name of the participant, or his trading bubbled trademarks used indicator of "Brand of the Index."

2. The method of collecting, processing and cataloging target information from unstructured sources according to claim 1, characterized in that to determine the visibility of a market participant (individual, legal entity or brand) in relation to other market participants in information networks and in printed publications for certain periods of time, the indicator "Index of Visibility" is used, calculated according to the following formula;

where i is the serial number of the market participant,

IF _i - percentage of the number of materials found for the i-th market participant for the selected period of time:

where N _i - the number of materials in which the i-th market participant occurs, for a selected period of time,

T - the total number of materials in which at least one market participant occurs;

Ar _i - percentage of the total area given to i-participant in publications selected for the selected period of time:

where S _k is the area given to the i-th market participant in the k-th publication,

S _j - the area given to all met market participants in the j-th publication;

NE _i - percentage of the number of publications in which the i-th market participant was found for a selected period of time:

where TT is the total number of publications in which at least one market participant has met,

LN _i - the number of publications in which the i-th market participant met for a selected period of time.