RU2546555C1

RU2546555C1 - Method of automated classification of formalised documents in electronic document circulation system

Info

Publication number: RU2546555C1
Application number: RU2013155168/08A
Authority: RU
Inventors: Сергей Владимирович Носенко; Игорь Дмитриевич Королев; Максим Игоревич Поддубный
Priority date: 2013-12-11
Filing date: 2013-12-11
Publication date: 2015-04-10

Abstract

FIELD: information technologies.

SUBSTANCE: in the method of automatic classification of formalised documents in an electronic document circulation system they identify and analyse characteristics of identical text sections (details) in a formalised document, and identified details are analysed. The informative part of the document is converted into text in natural language, document words are transformed into basic wordforms, insignificant words are deleted, word weights are counted in accordance with frequency of their occurrence, forming predicates of text criteria identification. According to the proposed set of manually classified texts they generate a system of predicates of text criteria identification, which is saved in a data base. Values of significant wordform weights are added into the system of predicates. If it is necessary to use a priori information on dependences of information areas between each other, algebra of end predicates is used, which makes it possible to perform operations over logical expressions, with the help of which information areas are described.

EFFECT: reduced time of system operation through making it possible to classify documents by form and identified metadata and to perform analysis only in the informative part of the document.

1 dwg

Description

Заявленное изобретение относится к системам классификации документов и может использоваться в системах электронного документооборота, базах данных, электронных хранилищах (электронных архивах) в случаях, когда существует необходимость классификации формализованных документов, поступающих из внешних автоматизированных систем, по тематическим признакам, видам (структурам) документов. Обеспечивает возможность априорного задания информационных областей, к которым относится электронный документ, в том числе с учетом всевозможных взаимосвязей таких информационных областей.The claimed invention relates to document classification systems and can be used in electronic document management systems, databases, electronic repositories (electronic archives) in cases where there is a need to classify formalized documents coming from external automated systems according to thematic features, types (structures) of documents. It provides the possibility of a priori assignment of information areas to which an electronic document relates, including taking into account all possible interconnections of such information areas.

Известен аналог - способ автоматической классификации документов (Li Y., Jain A. "Classification of text documents", The Computer Journal 41, 8, pp.537-546, 1998), заключающийся в том, что осуществляют преобразование документа из формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в документе в соответствии с частотами их появления, на этапе обучения, по предъявленному набору классифицированных вручную документов, формируют набор классификационных признаков, а при классификации документа осуществляют преобразование его из формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в документе, на основе простого байесовского классификационного критерия и классификационных признаков определяют принадлежность документа информационной области. Отметим, что данный способ предназначен для обработки машиночитаемых текстов на естественном языке. Данный способ простой байесовской классификации документов использует гипотезу о независимости слов документа друг от друга. При этом как документ, так и информационные области рассматриваются как вероятностные системы, для которых вычисляются вероятности появления словоформ как независимых событий. Для определения вероятности принадлежности документа категории вычисляется мера близости между этими двумя вероятностными системами. Способ простой байесовской классификации может использоваться как для бинарной классификации (необходимо определить, принадлежит документ категории или нет), так и для множественной (необходимо из списка категорий найти ту, которой принадлежит документ). В последнем случае документ может принадлежать лишь одной информационной области из списка. В тех задачах, где документ может одновременно принадлежать нескольким информационным областям, используют одновременно несколько бинарных классификаторов рассмотренного типа, каждый из которых определяет, принадлежит ли текущий документ данной информационной области. При этом принимается гипотеза о независимости информационных областей друг от друга.A known analogue is a method for automatically classifying documents (Li Y., Jain A. "Classification of text documents", The Computer Journal 41, 8, pp.537-546, 1998), which consists in converting a document from a storage format to natural language text, convert the words of the document into basic word forms, discard insignificant words, calculate the weights of the words in the document in accordance with the frequencies of their appearance, at the training stage, according to the presented set of manually classified documents, form a set of classification features, and when classifying Fiction of a document converts it from a storage format into a text in natural language, converts the document’s words into basic word forms, discards insignificant words, calculates the word weights in the document, and, based on a simple Bayesian classification criterion and classification features, determines whether the document belongs to the information area. Note that this method is intended for processing machine-readable texts in a natural language. This method of simple Bayesian classification of documents uses the hypothesis that the words of the document are independent of each other. At the same time, both the document and the information areas are considered as probabilistic systems for which the probabilities of occurrence of word forms as independent events are calculated. To determine the likelihood of belonging to a category document, a measure of proximity between these two probability systems is calculated. The simple Bayesian classification method can be used both for binary classification (it is necessary to determine whether a document belongs to a category or not), and for multiple (it is necessary to find the one that belongs to the document from the list of categories). In the latter case, the document may belong to only one information area from the list. In those problems where a document can simultaneously belong to several information areas, several binary classifiers of the considered type are used simultaneously, each of which determines whether the current document belongs to this information area. In this case, the hypothesis of independence of information areas from each other is accepted.

Однако данный способ обладает недостатками:However, this method has the disadvantages of:

не позволяет классифицировать документы в случае, когда информационные области тематически зависимы друг от друга, например, когда они иерархически подчинены друг другу;it does not allow to classify documents when information areas are thematically dependent on each other, for example, when they are hierarchically subordinate to each other;

не позволяет классифицировать документы по степени конфиденциальности;does not allow to classify documents by the degree of confidentiality;

анализ всего содержимого документа, а не только его информативной части.analysis of the entire contents of the document, and not just its informative part.

Известен также аналог - способ автоматической классификации документов (Пат. 6327581 Соединенные Штаты Америки, МПК G06F 015/18. Methods and apparatus for building a support vector machine classifier [Текст] / Carlton J.; заявитель и патентообладатель Microsoft Corporation. - №09/055477; заявл. 06.04.98; опубл. 04.12.01), заключающийся в том, что осуществляют преобразование документа из формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в документе в соответствии с частотами их появления; на этапе обучения по предъявленному набору классифицированных вручную документов формируют набор классификационных признаков, при классификации документа осуществляют преобразование его из формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в документе, на основе классификационного критерия SVM (Support Vector Machines) и классификационных признаков определяют принадлежность документа к информационной области. Данный способ, как и предыдущий, предназначен для обработки машиночитаемых текстов на естественном языке. Способ, описанный в [2], основан на классификации по методу SVM, который позволяет построить в многомерном пространстве признаков гиперплоскость, отделяющую признаки документов, принадлежащих информационной области, от признаков документов, не принадлежащих ей. Данный способ также может использоваться в случаях, когда документ может принадлежать сразу нескольким информационным областям.An analogue is also known - a method for automatically classifying documents (Pat. 6327581 United States of America, IPC G06F 015/18. Methods and apparatus for building a support vector machine classifier [Text] / Carlton J .; applicant and patentee Microsoft Corporation. - No. 09 / 055477; application form 04/06/98; publ. 04/12/01), which consists in converting a document from a storage format into a text in natural language, converting document words into basic word forms, discarding insignificant words, and calculating word weights in the document in accordance with the frequencies of their appearance; at the training stage, according to the presented set of manually classified documents, a set of classification features is formed, when a document is classified, it is converted from a storage format to a text in natural language, the document words are converted to basic word forms, insignificant words are discarded, word weights in the document are calculated based on the classification Support Vector Machines (SVM) criteria and classification features determine whether a document belongs to the information area. This method, like the previous one, is intended for processing machine-readable texts in a natural language. The method described in [2] is based on classification by the SVM method, which allows one to construct in a multidimensional feature space a hyperplane separating the features of documents belonging to the information area from the features of documents not belonging to it. This method can also be used in cases where a document can belong to several information areas at once.

Способ обладает недостатками:The method has the disadvantages of:

Известен также аналог - способ мультиклассовой классификации (Schapire R.E., Singer Y. "BoosTexter: A boosting-based system for text categorization". Machine Learning 39, 2/3, 2000, pp.135-168), заключающийся в том, что осуществляют преобразование документа из формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в документе в соответствии с частотами их появления и тем самым формируют вектор признаков документа, на этапе обучения по предъявленному набору классифицированных вручную документов формируют набор классификационных признаков, сохраняют классификационные признаки в базе данных, при классификации документа осуществляют преобразование его из формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в документе и формируют вектор признаков документа, после чего принимают решение о принадлежности либо не принадлежности документа каждой из информационной области. В этом способе также под текстами на естественном языке понимаются машиночитаемые тексты. Данный способ для классификации использует слабые гипотезы о принадлежности документа множеству информационных областей для итеративного уточнения функции распределения информационный областей на множестве документов. Для получения слабых гипотез используются методы бинарной классификации документов; а при классификации используют построенное распределение для определения списка информационных областей, которым принадлежит документ. Данный способ проявляет хорошую работоспособность, поскольку он многократно применяет простые методы классификации, что приводит к большей точности классификации. Кроме того, в рамках указанного способа категории не считаются независимыми. Зависимость между ними задается на этапе обучения посредством представления соответствующей обучающей выборки документов.An analogue is also known - a multiclass classification method (Schapire RE, Singer Y. "BoosTexter: A boosting-based system for text categorization". Machine Learning 39, 2/3, 2000, pp.135-168), which consists in converting a document from a storage format to natural language text, convert the document words into basic word forms, discard insignificant words, calculate the word weights in the document in accordance with the frequencies of their appearance and thereby form a document feature vector, at the training stage, according to the presented set of manually classified document nts form a set of classification features, save the classification features in a database, when classifying a document, convert it from a storage format to natural language text, convert the document words into basic word forms, discard insignificant words, calculate the word weights in the document and form a document feature vector , after which they decide on whether or not the document belongs to each of the information areas. In this method, natural language texts are also understood as machine-readable texts. This method for classification uses weak hypotheses that a document belongs to multiple information areas to iteratively refine the distribution function of information areas across multiple documents. To obtain weak hypotheses, methods of binary classification of documents are used; and during the classification, the constructed distribution is used to determine the list of information areas to which the document belongs. This method shows good performance, since it repeatedly uses simple classification methods, which leads to greater classification accuracy. In addition, in the framework of this method, the categories are not considered independent. The relationship between them is set at the training stage by presenting an appropriate training set of documents.

Недостатком данного способа является:The disadvantage of this method is:

невозможность использования при классификации априорной информации о зависимостях информационных областей друг от друга;the inability to use in the classification of a priori information about the dependencies of information areas from each other;

Наиболее близким по технической сущности к предлагаемому является способ автоматической классификации документов (Пат. 2254610 Российская Федерация, МПК G06F 17/30. Способ автоматической классификации документов [Текст] / Аграновский А.В., Арутюнян Р.Э., Хади Р.А., Телеснин Б.А.; заявитель и патентообладатель Государственное научное учреждение научно-исследовательский институт "СПЕЦВУЗАВТОМАТИКА". - №2003126907/09; заявл. 04.09.03; опубл. 20.06.05), принятый за прототип, заключающийся в том, что осуществляют преобразование документа из формата хранения в текст на естественном языке, преобразуют слова преобразованного документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в упомянутом документе в соответствии с частотами их появления и тем самым формируют вектор признаков документа, на этапе обучения по предъявленному набору классифицированных вручную документов формируют набор классификационных признаков, сохраняют классификационные признаки в базе данных, при классификации документа осуществляют преобразование его из формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в документе и формируют вектор признаков документа, после чего принимают решение о принадлежности либо не принадлежности документа каждой из категорий, отличающийся тем, что на этапе определения принадлежности документа каждой из категорий используют априорную информацию о зависимостях категорий друг от друга, задаваемую деревом категорий, при этом используют бинарные классификаторы для определения принадлежности документа категориям, после чего осуществляют анализ принадлежности каждой категории документа категориям более высокого уровня, и если число вершин дерева, которым принадлежит документ, превосходит число вершин, которым он не принадлежит, то принимают решение о соответствии документа текущей вершине, после чего производят корректировку решений классификатора на протяжении всего пути от текущей вершины до корня дерева и классифицируют этот документ по всем промежуточным вершинам дерева.The closest in technical essence to the proposed is a method for automatic classification of documents (Pat. 2254610 Russian Federation, IPC G06F 17/30. Method for automatic classification of documents [Text] / Agranovsky A.V., Arutyunyan R.E., Khadi R.A. , Telesnin B.A .; applicant and patentee State scientific institution research institute "SPECIAL INSTITUTE AUTOMATION" - No. 2003126907/09; application. 04.09.03; publ. 20.06.05), adopted for the prototype, namely, that carry out converting a document from a storage format to text on natural language, they convert the words of the transformed document into basic word forms, discard insignificant words, calculate the word weights in the said document in accordance with the frequencies of their appearance and thereby form a document feature vector, at the training stage, a set of classification features is formed from the presented set of manually classified documents, save classification features in the database; when classifying a document, it is converted from the storage format to text on in a native language, they convert the document words into basic word forms, discard insignificant words, calculate the word weights in the document and form a document feature vector, after which they make a decision about whether each category belongs to or does not belong to the document, characterized in that at the stage of determining the document’s belonging to each from categories, use a priori information about the dependencies of categories from each other, defined by a tree of categories, while using binary classifiers to determine the ownership the document’s categories, then they analyze the belonging of each document category to higher-level categories, and if the number of vertices of the tree that owns the document exceeds the number of vertices that it does not belong to, then they decide on whether the document matches the current vertex, and then make decisions classifier along the entire path from the current vertex to the root of the tree and classify this document for all intermediate vertices of the tree.

Недостатком прототипа является:The disadvantage of the prototype is:

Технический результат заключается в извлечении заданных метаданных и классификации формализованных документов в соответствии с ними (в том числе по степени конфиденциальности) и проведении анализа текста не всего содержимого документа, а только его информативной части при определении относимости документа к информационной области, что сократит время работы (повысит оперативность) системы.The technical result consists in extracting the specified metadata and classifying formalized documents in accordance with them (including the degree of confidentiality) and analyzing the text of not the entire contents of the document, but only its informative part when determining the relevance of the document to the information area, which will reduce the time ( will increase the efficiency) of the system.

Данный технический результат получают за счет того, что осуществляют выделение характеристик одинаковых участков текста Z={z₁, z₂,…, z_n} (реквизитов) формализованного документа. Каждый реквизит выразим конечным предикатом P(Z, Т, L), где Т - множество характеристик текста t, L={l₁, l₂,…, l_q} - множество конечных предикатов узнавания ключевых слов реквизита l, q - количество всех используемых ключевых слов.This technical result is obtained due to the fact that carry out the allocation of characteristics of the same sections of the text Z = {z ₁ , z ₂ , ..., z _n } (details) of a formalized document. Each attribute is expressed by the final predicate P (Z, T, L), where T is the set of text characteristics t, L = {l ₁ , l ₂ , ..., l _q } is the set of finite predicates for recognizing the keyword attributes l, q is the number of all keywords used.

Правило построения предиката узнавания реквизита формализованного документа, выразится следующей формулой [5]:The rule for constructing the predicate for recognizing the requisites of a formalized document is expressed by the following formula [5]:

где $t_{h}^{a}$

- предикат узнавания значения а h-той переменной текста; m - количество переменных текста, n - величина алфавита h-той переменной текста;

l_{i}^{b}

- предикат узнавания значения b ключевого слова соответствующего i-той зоне.Where

t_{h}^{a}

- predicate value of a recognition of the h-variable text; m is the number of text variables, n is the size of the alphabet of the hth variable text;

l_{i}^{b}

- the predicate of recognizing the value b of the keyword corresponding to the i-th zone.

В связи с небольшим количеством различных реквизитов формализованного документа (согласно ГОСТ Р 6.30-2003 подразумевает перечень 30 реквизитов документов) некоторое количество из них не определяют индивидуальность формы документа, например те, которые свойственны всем формам (текст) или вообще не свойственны в данных условиях применения (Государственный герб Российской Федерации в частной организации).Due to the small number of different details of a formalized document (according to GOST R 6.30-2003 it implies a list of 30 details of documents), some of them do not determine the individuality of the document form, for example, those that are common to all forms (text) or generally not typical under these conditions of use (State emblem of the Russian Federation in a private organization).

Форма документа выразится конечным предикатом P(V, Z, L), где V={ν₁, ν₂,…, ν_m} - множество форм документа, j={1, 2,…, m}; m - количество всех используемых форм документов, Z={z₁, z₂,…, z_n} - множество конечных предикатов реквизитов документа, n - количество всех реквизитов документов, L={l₁, l₂,…, l_q} - множество ключевых слов, q - количество всех используемых ключевых слов.The form of the document is expressed by the finite predicate P (V, Z, L), where V = {ν ₁ , ν ₂ , ..., ν _m } is the set of document forms, j = {1, 2, ..., m}; m is the number of all used document forms, Z = {z ₁ , z ₂ , ..., z _n } is the set of final predicates of the document details, n is the number of all document details, L = {l ₁ , l ₂ , ..., l _q } is the set of keywords, q is the number of all keywords used.

Правило построения предиката узнавания формы документа выразится следующей формулой [5]:The rule for constructing a predicate for recognizing the form of a document is expressed by the following formula [5]:

где $i = \bar{1, n}$

; z₁ - предикат узнавания реквизита для j-той формы документа;

l_{j}^{c}

- предикат узнавания уникального значения ключевого слова с-той формы документа.Where

i = \bar{one, n}

; z ₁ - the predicate of recognition of props for the j-th form of the document;

l_{j}^{c}

- a predicate for recognizing the unique meaning of a keyword from that form of a document.

С использованием правил (1, 2) создаются системы предикатов идентификации реквизитов и форм документов.Using the rules (1, 2), predicate systems for identifying details and document forms are created.

Форма документа однозначно задает места расположения реквизитов документа что позволяет:The form of the document uniquely sets the location of the details of the document that allows you to:

классифицировать документы по форме документа и степени конфиденциальности по соответствующему реквизиту из списка возможных значений;classify documents according to the form of the document and the degree of confidentiality according to the relevant requisite from the list of possible values;

проводить анализ только информативной части содержимого документа, например, только текста.analyze only the informative part of the contents of the document, for example, only text.

Информативную часть документа (далее - текст) преобразуют из формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в тексте в соответствии с частотами их появления и тем самым формируют предикаты идентификации признаков текста. На этапе обучения по предъявленному набору классифицированных вручную текстов формируют систему предикатов идентификации признаков текста, где количество предикатов в системе предикатов определяется количеством информационных областей, на которые необходимо классифицировать документы (количество исполнителей в автоматизированной системе). Сохраняют систему предикатов в базе данных. Количество предикатов в системе предикатов будет равно количеству информационных областей (количеству исполнителей в системе).The informative part of the document (hereinafter referred to as the text) is converted from the storage format into natural language text, the document words are converted to basic word forms, discarded insignificant words are calculated, the word weights in the text are calculated in accordance with the frequencies of their appearance, and thereby predicates the identification of text attributes. At the training stage, according to the presented set of manually classified texts, a predicate system for identifying text attributes is formed, where the number of predicates in the predicate system is determined by the number of information areas into which documents must be classified (the number of executors in the automated system). Store the predicate system in the database. The number of predicates in the predicate system will be equal to the number of information areas (the number of executors in the system).

Правило построения системы предикатов P(U, W) узнавания информационной области u_j∈U=[u₁, u₂,…, u_s], выразится следующей формулой:The rule for constructing the predicate system P (U, W) for recognizing the information domain u _j ∈U = [u ₁ , u ₂ , ..., u _s ] is expressed by the following formula:

где $w_{i}_{g}^{f}$

- предикат узнавания значения веса f значимого слова w_i∈W={w₁, w₂,…, w_p} - множество значимых слов текстов, в тексте документа d u_j-той информационной области по g-тому значению веса слова; р - количество значимых слов текстов.Where

w_{i}_{g}^{f}

- the predicate of recognizing the value of the weight f of the significant word w _i ∈W = {w ₁ , w ₂ , ..., w _p } - the set of significant words of the texts in the document text du _j- th information area for the g-th value of the word weight; p is the number of significant words in the texts.

На этапе работы системы, при классификации текста, осуществляют преобразование его из формата хранения в текст на естественном языке, преобразуют слова текста в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в тексте, получившиеся значения подставляют в систему предикатов (3), находящуюся в базе данных. По предикатам в системе предикатов, принявшим значение истинности «1», определяется принадлежность к соответствующей информационной области или областям. При этом, в случае необходимости использования априорной информации о зависимостях информационных областей друг от друга, например, для задания дерева информационных областей, используем алгебру конечных предикатов [6], позволяющую проводить полный спектр операций над логическими выражениями, а соответственно и над информационными областями, описанными конечными предикатами (добавление, исключение, сложение информационных областей и т.д.). Данный способ классификации позволяет с учетом этого по входному документу определить, каким узлам дерева информационных областей он принадлежит, а каким нет. Отметим, что данный способ предназначен для обработки машиночитаемых текстов на естественном языке.At the stage of the system’s work, when classifying the text, it is converted from the storage format to the text in natural language, the words of the text are converted to basic word forms, the insignificant words are discarded, the word weights in the text are calculated, the resulting values are substituted into the predicate system (3), located in the database. According to the predicates in the predicate system that have taken the value of truth “1”, membership in the corresponding information area or areas is determined. Moreover, if it is necessary to use a priori information about the dependences of information areas on each other, for example, to define a tree of information areas, we use the finite predicate algebra [6], which allows us to carry out a full range of operations on logical expressions and, accordingly, on information areas described finite predicates (addition, exclusion, addition of information areas, etc.). This classification method allows taking into account this from the input document to determine which nodes of the tree of information areas it belongs to and which ones do not. Note that this method is intended for processing machine-readable texts in a natural language.

Вес f w_i словоформы в тексте документа d_j, рассчитывается по формуле:The weight fw _{i of the} word form in the text of the document d _j is calculated by the formula:

Здесь $c_{w_{i} d_{j}}$

- количество раз, которое w_i-я словоформа встречается в d_j-м тексте документа,

N_{d_{j}}

- общее количество словоформ в i-м тексте документа.Here

c_{w_{i} d_{j}}

- the number of times that the w _i -th word form occurs in the d _{j j-} th text of the document,

N_{d_{j}}

- the total number of word forms in the i-th text of the document.

Документы для классификации могут быть представлены в различных форматах, допускающих выделение из них текстового содержания. Это могут быть текстовые файлы различных форматов, графические файлы с графическим представлением некоторого текста, звуковые файлы с записью речи и другие файлы, для которых существует механизм выделения из них текста, отражающего их содержание. Каждый документ (либо обучающий, либо подвергающийся классификации) предварительно проходит стадию первичной обработки, на которой производится определение формата документа и установление того, возможно ли извлечение текста из документа данного формата. В случае положительного решения производится извлечение текста из документа. После разбиения текста на слова происходит определение для каждого слова его базовой словоформы по одному из способов [7-10]. Наиболее часто для решения подобных задач используется алгоритм Портера [10], заключающийся в использовании специальных правил отсечения и замены окончаний слов.Documents for classification can be presented in various formats, allowing the selection of text content from them. These can be text files of various formats, graphic files with a graphic representation of some text, sound files with voice recording and other files for which there is a mechanism for extracting text from them that reflects their content. Each document (either educational or subject to classification) preliminarily passes through the primary processing stage, at which the document format is determined and whether it is possible to extract text from a document of this format. In the case of a positive decision, the text is extracted from the document. After breaking the text into words, a definition is made for each word of its basic word form in one of the ways [7-10]. Most often, to solve such problems, the Porter algorithm [10] is used, which consists in using special rules for cutting off and replacing word endings.

Согласно предлагаемому способу каждый документ d_i представляется декартовым произведением переменных из множеств T×L×W, где для инициализации классификатора и построения классификационных признаков служит этап обучения классификатора. При этом должно быть задано множество обучающих документов, заранее классифицированных вручную. После извлечения из них текстового содержания происходит построение словаря значимых слов. Словарь содержит базовые словоформы всех слов, встречающихся в обучающих документах.According to the proposed method, each document d _i is represented by the Cartesian product of variables from the sets T × L × W, where the stage of classifier training is used to initialize the classifier and construct classification features. In this case, a lot of training documents must be specified, previously classified manually. After extracting textual content from them, a dictionary of meaningful words is built. The dictionary contains the basic word forms of all the words found in the training documents.

При классификации документа в расчет берутся не все словоформы из словаря документов, а лишь те из них, которые входят в рабочий словарь классификатора данной информационной области (данного исполнителя), что и определяет (3). В рабочий словарь классификатора включаются наиболее информативные словоформы с точки зрения определения принадлежности документа данной категории, не попавшие в стоп-словарь. Информативность словоформы w_i для классификатора по информационной области u_j определяется по следующей формуле [11]:When classifying a document, not all word forms from the document dictionary are taken into account, but only those that are included in the working dictionary of the classifier of this information area (this executor), which defines (3). The most informative word forms are included in the classifier’s working dictionary from the point of view of determining the membership of a document of this category, which did not fall into the stop dictionary. The informativeness of the word form w _i for the classifier according to the information domain u _j is determined by the following formula [11]:

При этом устанавливается порог информативности ε; в рабочий словарь классификатора включаются все словоформы, не попавшие в стоп-словарь, информативность которых превышает этот порог. Стоп-словарь состоит из словоформ, частоты встречаемости которых во множестве обучающих документов превышают заранее установленный порог δ. При этом отсекаются слова, не несущие смысловой нагрузки, такие как предлоги, союзы, вводные и общие слова и т.д. Значения коэффициента δ, согласно данному способу, устанавливаются в пределах от 0.05 до 0.7 в зависимости от специфики использования способа. Значения порога информативности δ могут быть различны в различных условиях использования способа.In this case, the information threshold ε is set; All word forms that do not fall into the stop dictionary, whose information content exceeds this threshold, are included in the classifier’s working dictionary. A stop dictionary consists of word forms whose frequency of occurrence in a set of training documents exceeds a predetermined threshold δ. At the same time, words that do not carry a semantic load, such as prepositions, conjunctions, introductory and general words, etc., are cut off. The values of the coefficient δ, according to this method, are set in the range from 0.05 to 0.7, depending on the specific use of the method. The values of the information threshold δ can be different in different conditions of using the method.

Классификация текстов (информативных частей) документов производится путем вычисления значений системы предикатов, описывающей информационные области. Система предикатов строится по правилу (3).The classification of texts (informative parts) of documents is carried out by calculating the values of the predicate system that describes information areas. The predicate system is constructed according to rule (3).

Изобретение поясняется чертежом.The invention is illustrated in the drawing.

На чертеже представлена блок-схема вычислительного устройства для реализации способа.The drawing shows a block diagram of a computing device for implementing the method.

Устройство для реализации способа (см. чертеж) состоит из блоков:A device for implementing the method (see drawing) consists of blocks:

1 источника документов;1 source of documents;

2 анализатора характеристик текста;2 text characteristics analyzers;

3 распознавания реквизитов документа;3 recognition of the details of the document;

4 распознавания формы документа;4 recognition of the form of the document;

5 выделения метаданных;5 highlighting metadata;

6 определения базовых словоформ;6 definitions of basic word forms;

7 создания рабочего словаря;7 creating a working dictionary;

8 определение весов словоформ текста документа;8 determination of the weights of word forms of the text of the document;

9 распознавания информационной области;9 recognition of the information area;

10 учета документа по метаданным;10 accounting documents for metadata;

11 обучения;11 training;

12 адресации документа;12 document addressing;

13 отправки в соответствии с полученной классификацией.13 shipments in accordance with the classification received.

Согласно способу устройство работает следующим образом:According to the method, the device operates as follows:

1. В режиме классификации.1. In the classification mode.

При появлении в источнике документов 1 нового документа он поступает в блок 2, который выявляет значения характеристик текста t участков документа и ключевых слов l в них. Значения t и l участков документа поступают в блок 3, где с помощью системы предикатов, построенных по правилу (1) распознаются реквизиты документа. Информация о распознанных реквизитах документа поступает в блок 4, где система предикатов, построенная по правилу (2), осуществляет распознавание формы документа.When a new document appears in the source of documents 1, it enters block 2, which reveals the values of text characteristics t of document sections and keywords l in them. The values of t and l of the document sections are sent to block 3, where the details of the document are recognized using the predicate system constructed according to rule (1). Information about the recognized details of the document goes to block 4, where the predicate system constructed according to rule (2), recognizes the form of the document.

В блоке 5 из поступившего документа от блока 2, используя сведения об определенной форме документа из блока 4, которая однозначно задает места расположения значений реквизитов документа, выделяются требуемые значения реквизитов, которые являются метаданными документа. Документ и соответствующие ему метаданные поступают в блок 10, где документ учитывается по своим метаданным и организуется хранение его эталонной копии. Также однозначно определенная в блоке 5 информативная часть документа поступает в блок 6, где слова преобразуются в словоформы. Полученные в блоке 6 словоформы поступают в блок 7, где в процессе работы системы происходит создание рабочего словаря из значимых слов.In block 5 of the incoming document from block 2, using information about a particular form of the document from block 4, which uniquely sets the location of the values of the details of the document, the required values of details are selected, which are the metadata of the document. The document and its corresponding metadata are received in block 10, where the document is taken into account according to its metadata and the storage of its reference copy is organized. Also, the informative part of the document uniquely defined in block 5 is sent to block 6, where the words are converted into word forms. The word forms obtained in block 6 are sent to block 7, where in the process of the system’s work a working dictionary is created from meaningful words.

Полученные в блоке 6 словоформы поступают в блок 8, где производится расчет весов f словоформ информативной части документа, попавших в рабочий словарь. Из блока 8 значения весов полученных словоформ поступают в блок 9, где происходит распознавание информационной области u_i путем вычисления значений предикатов системы предикатов, построенной по правилу (3).The word forms obtained in block 6 go to block 8, where we calculate the weights f of the word forms of the informative part of the document that fell into the working dictionary. From block 8, the weights of the obtained word forms are transferred to block 9, where the information domain u _i is recognized by calculating the predicate values of the predicate system constructed according to rule (3).

Поступившему документу и метаданным из блока 10 в блок 12, с использованием полученных значений из блока 9 присваиваются адреса соответствующие информационной области. Далее в блоке 13 происходит отправка документа по адресам (классификация в соответствии с информационной областью).The received document and metadata from block 10 to block 12, using the obtained values from block 9, are assigned the addresses corresponding to the information area. Next, in block 13, the document is sent to the addresses (classification in accordance with the information area).

2. В режиме обучения.2. In training mode.

Режим обучения системой используется в трех случаях:The system learning mode is used in three cases:

в случае невозможности распознавания системой предикатов реквизитов документа в блоке 3 по значениям переменных документа t и l. В этом случае оператором системы через блок 11 вносятся изменения в систему предикатов блока 3 или определяется реквизит документа «вручную»;if it is impossible for the predicate system to recognize the details of the document in block 3 by the values of the document variables t and l. In this case, the system operator through block 11 makes changes to the predicate system of block 3 or determines the document attribute “manually”;

в случае невозможности распознавания системой предикатов формы документа в блоке 4 по значениям предикатов системы предикатов блока 3. В этом случае оператором системы через блок 11 вносятся изменения в систему предикатов блока 4 или определяется форма документа «вручную»;if it is impossible for the predicate system to recognize the document form in block 4 by the predicate values of the predicate system of block 3. In this case, the system operator through block 11 makes changes to the predicate system of block 4 or determines the document form “manually”;

в случае невозможности распознавания системой предикатов информационной области в блоке 9 по значениям весов значимых слов из рабочего словаря, извлеченных из информативной части документа. В этом случае оператором системы через блок 11 вносятся изменения в систему предикатов блока 9 или определяется информационная область документа «вручную».in case of impossibility of recognition by the predicate system of the information area in block 9 by the values of the weights of significant words from the working dictionary, extracted from the informative part of the document. In this case, the system operator through block 11 makes changes to the predicate system of block 9 or determines the information area of the document "manually".

Таким образом, способ позволяет классифицировать документы с учетом степени конфиденциальности и любых других атрибутов (отраженных в метаданных), анализа только информативной части документа с учетом априорных зависимостей между информационными областями, чем достигается поставленный выше технический результат.Thus, the method allows to classify documents taking into account the degree of confidentiality and any other attributes (reflected in the metadata), analysis of only the informative part of the document, taking into account a priori dependencies between information areas, thereby achieving the above technical result.

Источники информацииInformation sources

1. Li Y., Jain A. "Classification of text documents". The Computer Journal 41, 8, pp.537-546, 1998.1. Li Y., Jain A. "Classification of text documents". The Computer Journal 41, 8, pp. 537-546, 1998.

2. Пат. 6327581 Соединенные Штаты Америки, МПК G06F 015/18. Methods and apparatus for building a support vector machine classifier [Текст] / Carlton J.; заявитель и патентообладатель Microsoft Corporation. - №09/055477; заявл. 06.04.98; опубл. 04.12.01.2. Pat. 6327581 United States of America, IPC G06F 015/18. Methods and apparatus for building a support vector machine classifier [Text] / Carlton J .; Applicant and Patent Holder of Microsoft Corporation. - No. 09/055477; declared 04/06/98; publ. 12/04/01.

3. Schapire R.E., Singer Y. "BoosTexter: A boosting-based system for text categorization". Machine Learning 39, 2/3, 2000, pp.135-168.3. Schapire R.E., Singer Y. "BoosTexter: A boosting-based system for text categorization". Machine Learning 39, 2/3, 2000, pp. 135-168.

4. Пат. 2254610 Российская Федерация, МПК G06F 17/30. Способ автоматической классификации документов [Текст] / Аграновский А.В., Арутюнян Р.Э., Хади Р.А., Телеснин Б.А.; заявитель и патентообладатель Государственное научное учреждение научно-исследовательский институт "СПЕЦВУЗАВТОМАТИКА". - №2003126907/09; заявл. 04.09.03; опубл. 20.06.05 - прототип.4. Pat. 2254610 Russian Federation, IPC G06F 17/30. The method of automatic classification of documents [Text] / Agranovsky A.V., Harutyunyan R.E., Khadi R.A., Telesnin B.A .; Applicant and patent holder State Scientific Institution Research Institute "SPECIAL UNIVERSITY". - No. 2003126907/09; declared 09/04/03; publ. 06/20/05 - a prototype.

5. Подходы к оперативной идентификации формализованных электронных документов в автоматизированных делопроизводствах / И.Д. Королев, С.В. Носенко // Политематический сетевой электронный научный журнал Кубанского государственного аграрного университета (Научный журнал КубГАУ) [Электронный ресурс]. - Краснодар: КубГАУ, 2013. - №08(092). - IDA [article ID]: 0921308074. - Режим доступа: http://ej.kubagro.ru/2013/08/pdf/74.pdf, 0,875 у.п.л.5. Approaches to the operational identification of formalized electronic documents in automated office work / I.D. Korolev, S.V. Nosenko // Political Mathematical Network Electronic Scientific Journal of the Kuban State Agrarian University (Scientific journal KubSAU) [Electronic resource]. - Krasnodar: KubSAU, 2013. - No. 08 (092). - IDA [article ID]: 0921308074. - Access mode: http://ej.kubagro.ru/2013/08/pdf/74.pdf, 0.875 pp

6. М.Ф. Бондаренко, Ю.П. Шабанов-Кушнаренко. Об алгебре конечных предикатов. [Текст] // Научно-технический журнал «Бионика интеллекта». ХНУРЭ, г. Харьков, Украина - 2011, №3(77).6. M.F. Bondarenko, Yu.P. Shabanov-Kushnarenko. On the algebra of finite predicates. [Text] // Scientific and technical journal "Bionics of Intelligence". KNURE, Kharkov, Ukraine - 2011, No. 3 (77).

7. Porter M.F. "An algorithm for suffix stripping", Program, Vol.14, No.3, 1980, pp.130-137.7. Porter M.F. "An algorithm for suffix stripping", Program, Vol.14, No.3, 1980, pp. 130-137.

8. Пат. 2096825 Российская Федерация, МПК G06F 17/00, G06F 17/30. Устройство обработки информации для информационного поиска [Текст] / Ковалев М.В., Виргунов И.В., Наймушин И.А., Четверов В.В; заявитель и патентообладатель Общество с ограниченной ответственностью "Информбюро". - №96119820/09; заявл. 14.10.96; опубл. 20.11.97, Бюл. №14.8. Pat. 2096825 Russian Federation, IPC G06F 17/00, G06F 17/30. Information processing device for information retrieval [Text] / Kovalev MV, Virgunov IV, Naimushin IA, Chetverov V.V; Applicant and patent holder Informburo Limited Liability Company. - No. 96119820/09; declared 10/14/96; publ. 11/20/97, Bull. Number 14.

9. Пат. 6308149 Соединенные Штаты Америки, МПК G06F 17/27. Grouping words with equivalent substrings by automatic clustering based on suffix relationships [Текст] / Gaussier E., Grefenstette G., Chanod J.-P.; заявитель и патентообладатель Xerox Corporation. - №09/213309; заявл. 16.12.98; опубл. 23.10.01.9. Pat. 6308149 United States of America, IPC G06F 17/27. Grouping words with equivalent substrings by automatic clustering based on suffix relationships [Text] / Gaussier E., Grefenstette G., Chanod J.-P .; applicant and patentee of Xerox Corporation. - No. 09/213309; declared 12/16/98; publ. 10/23/01.

10. Пат. 6430557 Соединенные Штаты Америки, МПК G06F 017/30; G06F 017/27; G06F 017/21. Identifying a group of words using modified query words obtained from successive suffix relationships [Текст] / Gaussier E., Grefenstette G., Chanod J.-P.; заявитель и патентообладатель Xerox Corporation. - №09/212662; заявл. 16.12.98; опубл. 06.08.02.10. Pat. 6430557 United States of America, IPC G06F 017/30; G06F 017/27; G06F 017/21. Identifying a group of words using modified query words obtained from successive suffix relationships [Text] / Gaussier E., Grefenstette G., Chanod J.-P .; applicant and patentee of Xerox Corporation. - No. 09/212662; declared 12/16/98; publ. 08/06/02.

11. Craven M., DiPasquo D., Freitag D. et al. "Learning to construct knowledge bases from the World Wide Web", Artificial Intelligence, Vol.118(1-2), 2000, pp.69-113.11. Craven M., DiPasquo D., Freitag D. et al. "Learning to construct knowledge bases from the World Wide Web", Artificial Intelligence, Vol. 118 (1-2), 2000, pp. 69-113.

Claims

A method for automatically classifying formalized documents in an electronic document management system, which consists in converting a document from a storage format to natural language text, converting the words of the converted document into basic word forms, discarding insignificant words, and calculating word weights in the document in accordance with their frequencies occurrences and thereby form the characteristics of the document: at the stage of training for a set of manually classified documents form a set of classifications Characteristics, save the classified characteristics in the database; when classifying a document on the basis of the classified characteristics obtained using a database, a decision is made on the relevance of the document to each of the information areas, at the stage of determining the identity of the document of each information area, use a priori information about the dependencies of the categories from each other, characterized in that before converting the document from the storage format in a text in a natural language, areas of an information document for extracting metadata and an informative part, on Ape training on classification criteria (weights meaningful words) to create a system of predicates feature identification information of the document, predicates the system is stored in the database; at the stage of the system’s operation, the resulting values of the weights of significant word forms are substituted into the predicate system in the database; if it is necessary to use a priori information about the dependences of information areas between themselves, an algebra of finite predicates is used, which allows operations on logical expressions with the help of which information areas are described.