RU2759887C1

RU2759887C1 - Method for automatic classification of formalized electronic graphic and text documents in the electronic document circulation system with automatic formation of electronic cases

Info

Publication number: RU2759887C1
Application number: RU2020144075A
Authority: RU
Inventors: Игорь Дмитриевич Королев; Максим Юрьевич Филиппов; Вадим Сергеевич Назинцев
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-11-18

Abstract

FIELD: computing.

SUBSTANCE: invention relates to computing. The method consists in the fact that on the basis of information about the characteristics of the document and the recognized details of the document determine the type of the document; on the basis of certain areas of information responsibility (IRA), a priori information about the structure of the organization and the type of document, they receive a data tuple and assign it to the "resolution" document requisite; on the basis of a certain IRA and the type of document identified, determine an article of the departmental List (hereinafter referred to as the List) of documents with their storage periods, to which the executed document can be attributed; based on certain details and unique keywords, they determine the outline of the electronic document management system (EDMS); on the basis of certain articles of the List and the outline of the EDMS, determine the storage period of the executed document; on the basis of a specific article of the List, the storage period and the executor of the document, determine the case into which the executed document will be distributed; on the basis of a certain confidentiality label of the document and the known confidentiality label of the case to which the document will be distributed, the correspondence of the confidentiality labels of the document and the case to which it is distributed is checked.

EFFECT: automation of the classification of formalized electronic text and graphic documents in the electronic document management system according to the areas of information responsibility of officials for their report to the decision-maker, and their distribution into electronic files.

1 cl, 1 dwg

Description

Область техники, к которой относится изобретениеThe technical field to which the invention relates

Изобретение относится к системам классификации и аннотирования графических и текстовых документов и может использоваться в системах электронного документооборота, базах данных, автоматизированных системах, где существует необходимость классификации формализованных электронных графических и текстовых документов по степеням конфиденциальности, содержащейся в них информации и областям информационной ответственности должностных лиц с учетом уровня их допуска к указанной информации, а также необходимость автоматического формирования электронных дел по результатам аннотирования информативной части каждого документа.The invention relates to systems for the classification and annotation of graphic and text documents and can be used in electronic document management systems, databases, automated systems, where there is a need to classify formalized electronic graphic and text documents according to the degrees of confidentiality, information contained in them and areas of information responsibility of officials with taking into account the level of their access to the specified information, as well as the need for automatic generation of electronic files based on the results of annotating the informative part of each document.

Уровень техникиState of the art

а) Описание аналоговa) Description of analogs

Известен аналог - способ мультиклассовой классификации (Schapire R.E., Singer Y. "BoosTexter: A boosting-based system for text categorization". MachineLearning 39, 2/3, 2000, pp. 13 5-168), заключающийся в том, что осуществляют преобразование документа из формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в документе в соответствии с частотами их появления и тем самым формируют вектор признаков документа; на этапе обучения по предъявленному набору классифицированных вручную документов формируют набор классификационных признаков, сохраняют классификационные признаки в базе данных; при классификации документа осуществляют преобразование его из формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в документе и формируют вектор признаков документа, после чего принимают решение о принадлежности либо не принадлежности документа каждой информационной области.Known analogue - a method of multiclass classification (Schapire RE, Singer Y. "BoosTexter: A boosting-based system for text categorization". MachineLearning 39, 2/3, 2000, pp. 13 5-168), which consists in the fact that they carry out the transformation document from the storage format to text in natural language, transform the words of the document into basic word forms, discard insignificant words, calculate the weights of words in the document in accordance with the frequencies of their occurrence and thereby form a vector of document features; at the stage of training, according to the presented set of manually classified documents, a set of classification features is formed, the classification features are stored in the database; when classifying a document, it is converted from a storage format to a text in a natural language, the words of the document are converted into basic word forms, insignificant words are discarded, the weights of words in the document are counted and a vector of document signs is formed, after which a decision is made whether the document belongs to each information document. area.

Недостатком данного способа являются то, что он не позволяет классифицировать формализованные электронные графические документы по областям информационной ответственности должностных лиц.The disadvantage of this method is that it does not allow to classify formalized electronic graphic documents by areas of information responsibility of officials.

Также известен аналог - способ обучения классификатора, предназначенного для определения категории документа (Патент РФ № 2672395, МПК G06K 9/00, G06F 17/30, 2018), заключающийся в получении документов, которые принадлежат к категории, для каждого полученного документа определении содержащихся в нем объектов, являющихся графическими элементами, для каждого полученного документа формировании набора признаков, состоящего из определенных объектов, при этом упомянутыми признаками являются признаки, характеризующие наличие объектов, местоположение объектов, количество объектов, расположение одного объекта по отношению к другому объекту, размеры объекта, угол наклона объекта, выполнении построения классификатора на основании значений сформированных признаков для получения документов.An analogue is also known - a method of training a classifier designed to determine the category of a document (RF Patent No. 2672395, IPC G06K 9/00, G06F 17/30, 2018), which consists in obtaining documents that belong to the category, for each received document, the definition contained in it of objects, which are graphic elements, for each received document the formation of a set of features consisting of certain objects, while the mentioned features are features characterizing the presence of objects, the location of objects, the number of objects, the location of one object in relation to another object, the size of the object, the angle the inclination of the object, performing the construction of the classifier based on the values of the generated features to obtain documents.

Недостатком данного способа является то, что он не позволяет классифицировать формализованные электронные графические документы по областям информационной ответственности должностных лиц.The disadvantage of this method is that it does not allow the classification of formalized electronic graphic documents by areas of information responsibility of officials.

б) Описание ближайшего аналога (прототипа)b) Description of the closest analogue (prototype)

Наиболее близким по технической сущности к предлагаемому является способ автоматической классификации электронных документов в системе электронного документооборота с автоматическим формированием электронных дел (Патент РФ № 2692972, МПК G06F 17/30, G06F 17/21, 2019), заключающийся в выделении и анализе формальной части поступившего документа (реквизитов), осуществлении преобразования информативной части документа в текст на естественном языке, преобразования слов преобразованного документа (за исключением отдельных слов и словосочетаний, соответствующих временным интервалам выполнения определенной документом деятельности) в базовые словоформы, отбрасывании незначимых слов, осуществлении подсчета весов слов в документе в соответствии с частотами их появления и формировании признаков документа, на основании полученных классификационных признаков формировании реквизита «резолюция», определении срока хранения исполненного документа и принятии решения об определении дела, в которое требуется распределить исполненный документ.The closest in technical essence to the proposed method is the method of automatic classification of electronic documents in the electronic document management system with automatic generation of electronic files (RF Patent No. 2692972, IPC G06F 17/30, G06F 17/21, 2019), which consists in the selection and analysis of the formal part of the received document (details), converting the informative part of the document into text in natural language, converting the words of the converted document (with the exception of individual words and phrases corresponding to the time intervals for performing the activity specified by the document) into basic word forms, discarding insignificant words, calculating the weights of words in the document in accordance with the frequencies of their appearance and the formation of document signs, on the basis of the obtained classification signs, the formation of the "resolution" variable, determining the storage period of the executed document and making a decision on determining the case, which requires to distribute the executed document.

Недостатком данного способа является отсутствие возможности классифицировать формализованные электронные графические документы по областям информационной ответственности должностных лиц, чем достигается заявленный технический результат.The disadvantage of this method is the inability to classify formalized electronic graphic documents by areas of information responsibility of officials, which achieves the claimed technical result.

Раскрытие сущности изобретенияDisclosure of the essence of the invention

а) технический результат, на достижение которого направлено изобретениеa) the technical result to achieve which the invention is aimed

Техническим результатом настоящего изобретения является автоматизация классификации формализованных электронных текстовых и графических документов в системе электронного документооборота по областям информационной ответственности должностных лиц для их доклада лицу, принимающему решения, и распределение их в электронные дела.The technical result of the present invention is the automation of the classification of formalized electronic text and graphic documents in the electronic document management system according to the areas of information responsibility of officials for their report to the decision-maker, and their distribution into electronic files.

б) совокупность существенных признаковb) a set of essential features

Под формализованным текстовым документом понимается типовой (стандартный) документ, обладающий типовым составом реквизитов и их расположением, определенным нормативными правовыми актами, содержательная (информативная) часть которого представлена реквизитом - текст документа (примечание - может содержать в составе документа графические элементы). Под формализованным графическим документом понимается типовой стандартный документ, обладающий типовым составом реквизитов и их расположением, определенным нормативными правовыми актами, содержательная (информативная) часть которого представлена реквизитом - графическое изображение, состоящее из отдельных графических элементов, построенных посредством линий, штрихов, светотени, точек, цвета (ограничение: отсутствует реквизит - текст документа). Под графическим элементом понимается условные формализованные обозначения - знаки (символы), совокупность которых составляют графическое изображение. Для достижения указанного технического результата предложен способ автоматической классификации формализованных электронных графических и текстовых документов в системе электронного документооборота с автоматическим формированием электронных дел, заключающийся в том, что определяют области формализованного документа для извлечения метаданных и информативной части; для текстовых документов осуществляют преобразование документа из формата хранения в текст на естественном языке, преобразуют слова обработанного документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в документе в соответствии с частотами их появления, на основе распознанных реквизитов и значений ключевых слов этих реквизитов определяют конкретный вид электронного документа, с помощью значений весов слов в документе определяют область информационной ответственности (далее - ОИО); при преобразовании слов документа в базовые словоформы выделяют и оставляют без изменений отдельные слова и словосочетания, соответствующие временным интервалам выполнения определяемой документом деятельности, формируют, тем самым, вектор данных о сроках исполнения документа; на основе определенных ОИО, а также априорных сведений о структуре организации (учреждения), в том числе об отношениях подчиненности между должностными лицами организации (учреждения) и уровнях их допуска к различным степеням конфиденциальности документов, формируют первый набор классификационных признаков; на основе определенных вида документа и ОИО, к которым относится документ, при помощи предикатов узнавания ключевых слов и отдельных реквизитов формальной части формируют второй набор классификационных признаков; на основе определенных ОИО и вида документа формируют третий набор классификационных признаков и определяют статью Перечня документов со сроками их хранения (далее - Перечень), хранящуюся в базе данных, к которой может быть отнесен исполненный документ; на основании ранее определенных реквизитов и уникальных ключевых слов, относящихся к этим реквизитам, формируют четвертый набор классификационных признаков, определяют контур системы электронного документооборота (далее - СЭД), в котором был разработан документ; на основании определенных статьи Перечня и контура СЭД формируют пятый набор классификационных признаков и определяют срок хранения исполненного документа; на основании определенной статьи Перечня, определенного срока хранения исполненного документа и исполнителя документа формируют шестой набор классификационных признаков и определяют дело, в которое будет распределен исполненный документ; на основании определенной метки конфиденциальности документа и известной метки конфиденциальности дела, в которое будет распределен документ, проверяют соответствие меток конфиденциальности документа и дела, в которое его распределяют; на этапе обучения по набору классифицированных вручную документов формируют систему предикатов определения ОИО; формируют систему предикатов идентификации метки конфиденциальности документа; сохраняют указанные системы предикатов в базе данных; по набору документов, для которых вручную заполнен реквизит «резолюция», формируют систему предикатов идентификации исполнителя поручения по поступившему документу и систему предикатов идентификации поручения, сохраняют системы предикатов в базу данных; по набору классифицированных вручную документов, формируют систему предикатов определения статьи Перечня, формируют систему предикатов определения контура СЭД, в котором был разработан документ, формируют систему предикатов определения срока хранения документа, сохраняют указанные системы предикатов в базе данных; формируют систему предикатов определения дела, в которое будет распределен исполненный документ, формируют систему предикатов проверки соответствия меток конфиденциальности документа и дела, в которое будет распределен документ, и сохраняют указанные системы предикатов в базе данных; при классификации документов с помощью базы данных принимают решение об относимости документа каждой из информационных областей и каждой из меток конфиденциальности, подставляют первый набор классификационных признаков в систему предикатов идентификации исполнителя поручения и по предикатам, принявшим значение «истина», принимают решение об отнесении документа к компетенции конкретных сотрудников, подчиненных руководителю; подставляют второй набор классификационных признаков в систему предикатов идентификации поручения и по предикатам, принявшим значение «истина», принимают решение о назначении исполнителям конкретных поручений по исполнению поступившего документа; полученные данные об исполнителе, поручении и сроке исполнения, а также полученные любым способом данные о дате рассмотрения документа, объединяют в кортеж данных и присваивают его реквизиту документа «резолюция»; с помощью базы данных принимают решение об относимости документа к каждой из статей Перечня, подставляют третий набор классификационных признаков в систему предикатов определения статьи Перечня и по предикатам, принявшим значение «истина», принимают решение об отнесении документа к конкретной статье Перечня; подставляют четвертый набор классификационных признаков в систему предикатов определения контура СЭД и по предикатам, принявшим значение «истина», принимают решение о контуре СЭД, в котором был разработан документ; подставляют пятый набор классификационных признаков в систему предикатов узнавания срока хранения документа, и по предикатам, принявшим значение «истина», принимают решение о присвоении срока хранения исполненному документу; подставляют шестой набор классификационных признаков в систему предикатов определения дела, в которое будет распределен исполненный документ и по предикатам, принявшим значение «истина», принимают решение об определении дела, в которое требует распределить исполненный документ, отличающийся тем, что на основе сведений о характеристиках документа и распознанных реквизитах в документе определяют вид документа - графический или текстовый; при условии определения графического документа, используя технологию распознавания образов графических элементов (алгоритм Виолы Джонса), определяют наличие в документе графических элементов, классифицированных по ОИО должностных лиц, в случае обнаружения графических элементов, получившиеся значения подставляют в систему предикатов, по которой определяют ОИО должностных лиц для графического документа; на этапе обучения по набору классифицированных вручную графических документов формируют наборы графических элементов, относящихся к конкретным ОИО должностных лиц, на основе которых используя технологию распознавания образов графических элементов (алгоритм Виолы Джонса) обучают классификатор графических элементов; на основании значений классификатора формируют систему предикатов определения ОИО должностных лиц, сохраняют указанную систему предикатов в базе данных; на этапе классификации графического электронного документа подставляют набор полученных значений выявленных графических элементов в систему предикатов определения ОИО должностных лиц и по предикатам, принявшим значение «истина», принимают решение об отнесении графического документа к выявленной ОИО должностных лиц.A formalized text document is understood as a typical (standard) document that has a typical composition of requisites and their location, determined by regulatory legal acts, the content (informative) part of which is represented by a requisite - the text of the document (note - may contain graphic elements in the document). A formalized graphic document is understood as a typical standard document that has a typical composition of requisites and their location, determined by regulatory legal acts, the content (informative) part of which is represented by a requisite - a graphic image consisting of individual graphic elements constructed by means of lines, strokes, chiaroscuro, dots, colors (restriction: no props - document text). A graphic element is understood as conventional formalized designations - signs (symbols), the totality of which make up a graphic image. To achieve the specified technical result, a method is proposed for automatic classification of formalized electronic graphic and text documents in an electronic document management system with automatic generation of electronic files, which consists in determining the areas of a formalized document for extracting metadata and an informative part; for text documents, the document is converted from the storage format to the text in natural language, the words of the processed document are converted into basic word forms, insignificant words are discarded, the weights of words in the document are counted in accordance with the frequencies of their occurrence, based on the recognized attributes and the values of the keywords of these attributes determine the specific type of electronic document, using the values of the word weights in the document, determine the area of information responsibility (hereinafter - OIO); when converting the words of a document into basic word forms, separate words and phrases corresponding to the time intervals for performing the activity determined by the document are selected and left unchanged, thereby forming a vector of data on the timing of the execution of the document; on the basis of certain OIO, as well as a priori information about the structure of the organization (institution), including the relationship of subordination between officials of the organization (institution) and the levels of their admission to various degrees of confidentiality of documents, form the first set of classification signs; on the basis of a certain type of document and OIO to which the document belongs, with the help of predicates for recognizing keywords and individual requisites of the formal part, a second set of classification features is formed; on the basis of the determined OIO and the type of document, a third set of classification signs is formed and the article of the List of documents with their storage periods (hereinafter referred to as the List), stored in the database, to which the executed document can be attributed; on the basis of previously defined details and unique keywords related to these details, a fourth set of classification features is formed, the outline of the electronic document management system (hereinafter - EDMS), in which the document was developed, is determined; on the basis of certain articles of the List and the outline of the EDMS, the fifth set of classification signs is formed and the storage period of the executed document is determined; on the basis of a certain article of the List, a certain storage period of the executed document and the executor of the document, form the sixth set of classification signs and determine the case into which the executed document will be distributed; on the basis of a certain confidentiality label of the document and the known confidentiality label of the case into which the document will be distributed, the compliance of the confidentiality labels of the document and the case in which it is distributed is checked; at the stage of training on a set of manually classified documents, a system of predicates for determining the OIO is formed; form a predicate system for identifying a confidentiality label of a document; save the specified predicate systems in the database; according to the set of documents for which the "resolution" requisite is manually filled in, form a predicate system for identifying the executor of the order according to the received document and a system for identifying the order predicates, save the predicate systems in the database; based on a set of manually classified documents, form a predicate system for defining an entry in the List, form a predicate system for defining the EDMS contour in which the document was developed, form a predicate system for determining the retention period of a document, store these predicate systems in the database; a system of predicates for defining a case is formed into which the executed document will be distributed, a system of predicates for checking the compliance of confidentiality marks of a document and a case into which the document will be distributed is formed, and these predicate systems are stored in the database; when classifying documents with the help of a database, a decision is made on the relevance of a document to each of the information areas and each of the confidentiality labels, the first set of classification features is substituted into the system of predicates for identifying the executor of the assignment, and according to the predicates that have taken the value "true", a decision is made to refer the document to competence specific employees subordinate to the head; substitute the second set of classification signs into the system of order identification predicates and, based on the predicates that have taken the value "true", make a decision on assigning specific orders to executors to execute the received document; the received data about the executor, the order and the deadline, as well as the data obtained in any way on the date of the document's consideration, are combined into a data tuple and assigned to the document requisite "resolution"; using the database, make a decision on the relevance of the document to each of the List entries, substitute the third set of classification features into the predicate system for determining the List entry and, based on the predicates that have taken on the value "true", decide to refer the document to a specific List entry; substitute the fourth set of classification features into the system of predicates for determining the EDMS contour and, according to the predicates that have taken the value "true", make a decision on the EDMS contour in which the document was developed; the fifth set of classification features is substituted into the system of predicates for recognizing the storage period of the document, and according to the predicates that have taken the value "true", a decision is made to assign the storage period to the executed document; the sixth set of classification features is substituted into the system of predicates for defining the case, into which the executed document will be distributed, and according to the predicates that have taken the value "true", they decide on the definition of the case, in which the executed document should be distributed, characterized in that, based on information about the characteristics of the document and the recognized details in the document determine the type of the document - graphic or text; subject to the definition of a graphic document, using the technology of image recognition of graphic elements (Viola Jones algorithm), the presence in the document of graphic elements classified by the OIO of officials is determined, in case of detection of graphic elements, the resulting values are substituted into the predicate system by which the OIO of officials is determined for a graphic document; at the training stage, based on a set of manually classified graphic documents, sets of graphic elements related to specific OIO of officials are formed, on the basis of which, using the technology of image recognition of graphic elements (Viola Jones algorithm), the classifier of graphic elements is trained; based on the values of the classifier, a system of predicates for determining the OIO of officials is formed, the specified system of predicates is stored in the database; at the stage of classification of a graphic electronic document, a set of the obtained values of the identified graphic elements is substituted into the predicate system for determining the OIO of officials and, according to the predicates that have taken the value "true", a decision is made to refer the graphic document to the identified OIO of officials.

Краткое описание чертежейBrief Description of Drawings

На фигуре представлена блок-схема вычислительного устройства для реализации способа. Устройство для реализации способа состоит из блоков: ввода формализованных документов 1, анализа характеристик документа 2, распознавания реквизитов документа 3, распознавания вида документа 4, выделения метаданных 5, определения вида информативной части документа 6, определения базовых словоформ 7, создания рабочего словаря 8, определения весов словоформ текста документа 9, определения графических элементов 10, распознавания области информационной ответственности 11, учета документа по метаданным 12, обучения 13, распознавания метки конфиденциальности документа 14, адресации документа 15, формирования проекта резолюции руководителя 16, ввода-вывода в (из) систему 17, формирования дел 18.The figure shows a block diagram of a computing device for implementing the method. The device for implementing the method consists of blocks: input of formalized documents 1, analysis of document characteristics 2, recognition of document details 3, recognition of document type 4, extraction of metadata 5, definition of the type of informative part of the document 6, definition of basic word forms 7, creation of a working dictionary 8, definition weights of word forms of the text of the document 9, definition of graphic elements 10, recognition of the area of information responsibility 11, accounting of the document by metadata 12, training 13, recognition of the confidentiality label of the document 14, addressing the document 15, formation of the draft resolution of the head 16, input-output to (from) the system 17, formation of cases 18.

Осуществление изобретенияImplementation of the invention

При поступлении электронного документа (далее - ЭД):Upon receipt of an electronic document (hereinafter - ED):

1. Выделяют характеристики одинаковых участков текста Z - реквизитов. При этом априорно известно, что количество реквизитов формализованного ЭД ограничено. Каждый реквизит представим конечным предикатом P_Z(T,L), где Т - конечное множество характеристик текста t, L={l_q} - множество ключевых слов l реквизита, где

q' - количество всех используемых ключевых слов. Для написания правил построения предикатов используем математический аппарат логики предикатов.1. Highlight the characteristics of the same sections of the text Z - attributes. At the same time, it is a priori known that the number of details of a formalized ED is limited. Each variable is represented by a finite predicate P _Z (T, L), where T is a finite set of characteristics of a text t, L = {l _q } is a set of keywords l of a variable, where

q '- the number of all used keywords. To write the rules for constructing predicates, we use the mathematical apparatus of predicate logic.

Правило построения предиката узнавания реквизита формализованного документа, выразится следующей формулой:The rule for constructing a predicate for recognizing the requisite of a formalized document is expressed by the following formula:

где {b} - множество значимых слов в реквизитах формализованных документов;where {b} is a set of significant words in the details of formalized documents;

{h} - множество характеристик текста,

- возможные характеристики текста;{h} - a set of characteristics of the text,

- possible characteristics of the text;

{α} - множество переменных характеристик текста,

- возможные переменные характеристики текста;{α} - the set of variable characteristics of the text,

- possible variable characteristics of the text;

- предикат узнавания α-ой переменной h-ой характеристики текста;

- the predicate of recognition of the α-th variable of the h-th characteristic of the text;

l_b - предикат узнавания значимых слов в реквизитах.l _b - predicate of recognition of significant words in the requisites.

2. Вид документа определяется при помощи конечного предиката P_V(Z, L), где V - {v_j}, где

- множество видов документов, j' - количество всех используемых видов документов, Z - множество реквизитов документа, n - количество всех реквизитов документов. Правило построения предиката узнавания вида документа выразится следующей формулой:2. The type of document is determined using the final predicate P _V (Z, L), where V - {v _j }, where

- many types of documents, j '- the number of all used types of documents, Z - many details of the document, n - the number of all details of the documents. The rule for constructing a predicate for recognizing the type of document is expressed by the following formula:

где {v_j} - множество видов документов,

j - все используемые виды формализованных документов;where {v _j } is a set of types of documents,

j - all types of formalized documents used;

Z={z_j} - предикат узнавания i-го реквизита для j-го вида документа,Z = {z _j } - predicate of recognition of the i-th variable for the j-th type of document,

n - количество всех возможных реквизитов на документе;

n - the number of all possible details on the document;

- предикат узнавания уникального значения ξ ключевого слова q i-го реквизита j-го вида документа.

- the predicate of recognition of the unique value ξ of the keyword q of the i-th variable of the j-th type of document.

С использованием правил (1, 2) создаются системы предикатов идентификации формуляров (расположения и значений реквизитов) и видов поступающих документов. Формуляр документа однозначно задает места расположения реквизитов документа, что позволяет классифицировать документы по виду и степени ограничения доступа.Using the rules (1, 2), systems of predicates for identifying forms (location and values of attributes) and types of incoming documents are created. The document form uniquely specifies the location of the document details, which allows you to classify documents by type and degree of access restriction.

3. Определяют вид информативной части документа (текстовый или графический) для определения способа выделения информативной части документа для определения ОИО должностных лиц. Для этого с помощью правила (1) проводят проверку на наличие реквизита - текст документа (Z₁₈ - реквизит текст документа в соответствии с ГОСТ 7.0.97 2016 года).3. Determine the type of the informative part of the document (text or graphic) to determine the method of highlighting the informative part of the document to determine the OIO of officials. To do this, using rule (1), a check is made for the presence of the requisite - the text of the document (Z ₁₈ - the requisite text of the document in accordance with GOST 7.0.97 2016).

3.1. В случае, когда предикат узнавания реквизитов документа (1) по реквизиту Z₁₈ принимает значение «истина», информативную текстовую часть документа (далее - текст) преобразуют из формата хранения в текст на естественном языке, преобразуют слова документа в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в тексте в соответствии с частотами их появления и тем самым формируют предикаты идентификации признаков текста.3.1. In the case when the predicate for recognizing the details of the document (1) by _{variable Z 18} takes on the value "true", the informative text part of the document (hereinafter referred to as the text) is converted from the storage format to text in natural language, the words of the document are converted into basic word forms, and insignificant words are discarded , carry out the calculation of the weights of words in the text in accordance with the frequencies of their occurrence and thereby form predicates for identifying the features of the text.

Вес ƒ словоформы w_p в тексте документа d_x, рассчитывается по формуле:The weight ƒ of the word form w _p in the text of the document d _x is calculated by the formula:

где

- количество раз, которое w_p-я словоформа встречается в d_x-м тексте документа;

- общее количество словоформ в d_x-м тексте документа.where

- the number of times that the w _p -th word form occurs in the d _x -th text of the document;

- the total number of word forms in the d _x -th text of the document.

После разбиения текста на слова происходит определение для каждого слова его базовой словоформы по одному из способов. Для документов на естественном языке славянской группы предпочтительными являются алгоритмы лемматизации (процесса приведения слова к его нормальной форме (лемме), допустимо применение алгоритмов усечения окончаний, стохастических и статистических алгоритмов; для документов на естественном языке западногерманской группы - алгоритмов усечения окончаний, например, стеммер Портера (использование специальных правил отсечения и замены окончаний слов).After splitting the text into words, the definition for each word of its basic word form takes place in one of the ways. For documents in the natural language of the Slavic group, lemmatization algorithms are preferable (the process of reducing a word to its normal form (lemma), it is permissible to use algorithms for truncation of endings, stochastic and statistical algorithms; for documents in the natural language of the West German group, algorithms for truncation of endings, for example, Porter's stemmer (use of special rules for cutting and replacing word endings).

3.2. В случае, когда предикат узнавания реквизитов документа (1) по реквизиту Z₁₈ принимает значение «ложь», производят проверку документа на наличие графических элементов в информативной части документа. Для этого используют метод Виолы-Джонса, основанный на применении n-мерных Хаар-функций. Обученный классификатор, состоящий из каскадов признаков Хаара, определяет на графическом изображении наличие или отсутствие графических элементов. Для этого применяют операцию ковариации - производят сравнение двух изображений, одним из которых является изображение в графическом документе, а вторым - графический элемент, который необходимо распознать на изображении. Прямоугольная область (признак Хаара) перемещается по изображению и детектируются граничные точки с минимальными отличиями от образца, а затем применяется классификатор к каждому из положений. Признак Хаара является набором прямоугольных областей графического элемента, примыкающих друг к другу и разделенных на две группы. Для вычисления признака Хаара складываются яркости пикселей изображения в первой и второй группах прямоугольных областей, а затем вычисляют из первой полученной суммы вторую по формулам:3.2. In the case when the predicate for recognizing the details of the document (1) for the _{variable Z 18} takes the value "false", the document is checked for the presence of graphic elements in the informative part of the document. For this, the Viola-Jones method is used, based on the use of n-dimensional Haar functions. A trained classifier consisting of cascades of Haar features determines the presence or absence of graphic elements on a graphic image. For this, the covariance operation is used - two images are compared, one of which is an image in a graphic document, and the second is a graphic element that must be recognized in the image. A rectangular area (Haar feature) is moved across the image and boundary points with minimal differences from the sample are detected, and then a classifier is applied to each of the positions. The Haar trait is a set of rectangular areas of a graphic element adjacent to each other and divided into two groups. To calculate the Haar feature, the brightness of the image pixels in the first and second groups of rectangular regions is added, and then the second is calculated from the first resulting sum using the formulas:

где v_jk - яркость пикселя с координатами (j; k);where v _jk is the brightness of the pixel with coordinates (j; k);

α_i - сумма яркостей пикселей в i-й области первой группы;α _i - the sum of the brightness of the pixels in the i-th area of the first group;

b_i - сумма яркостей пикселей в i-й области второй группы;b _i - the sum of the brightness of the pixels in the i-th region of the second group;

X - значение признака Хаара;X is the value of the Haar attribute;

h_αi,w_αi,h_bi,w_bi - высота и ширина i-х областей первой и второй групп соответственно;h _αi , w _αi , h _bi , w _bi are the height and width of the i-th regions of the first and second groups, respectively;

γ_αi,x_αi,γ_bi,x_bi - смещения по оси у и х i-х областей первой и второй групп;γ _αi , x _αi , γ _bi , x _bi - displacements along the y-axis and x of the i-th regions of the first and second groups;

N_α и N_b - количество областей в первой и второй группах.N _α and N _b - the number of areas in the first and second groups.

Для выявления графического элемента с достаточной точностью требуется большое количество признаков Хаара. Для этого признаки Хаара организованы в каскады классификатора.To identify a graphic element with sufficient accuracy, a large number of Haar features are required. For this, Haar features are organized into classifier cascades.

Согласно имеющихся видов формализованных графических документов создается база данных, представляющая собой обученный классификатор - каскады признаков Хаара, которые могут определить на изображении в графических документах графические элементы, по которым происходит классификация формализованного графического документа по ОИО должностных лиц.According to the available types of formalized graphic documents, a database is created, which is a trained classifier - cascades of Haar features, which can determine graphic elements on the image in graphic documents, according to which the formalized graphic document is classified according to the OIO of officials.

Математическая модель подпроцесса определения наличия графических элементов в документе, используя логику предикатов выразится следующей формулой:The mathematical model of the sub-process for determining the presence of graphic elements in a document using predicate logic is expressed by the following formula:

Зададим интерпретацию для формулы (3):Let's set an interpretation for formula (3):

X={х₁,…,х'} - множество вычисленных признаков Хаара на изображении,X = {x ₁ , ..., x '} - the set of calculated Haar features on the image,

х∈{х₁,…,х'}, х₁ - первое значение вычисленного признака Хаара на изображении, х' - последнее значение вычисленного признака Хаара на изображении;х∈ {х ₁ ,…, х '}, х ₁ - the first value of the calculated Haar attribute in the image, x' - the last value of the calculated Haar attribute in the image;

K={k₁,…,k'} - множество признаков Хаара в классификаторе,K = {k ₁ ,…, k '} is the set of Haar features in the classifier,

k∈{k₁,…,k'}, k₁ - первое значение признаков Хаара в классификаторе, k' - последнее значение признаков Хаара в классификаторе;k∈ {k ₁ ,…, k '}, k ₁ is the first value of the Haar features in the classifier, k' is the last value of the Haar features in the classifier;

Р⁺={(x₁,…,х',k₁,…,k'):λ(P(х₁,…,х',k₁,…,k'))=1} - множество определенных графических элементов в документе.P ⁺ = {(x ₁ , ..., x ', k ₁ , ..., k'): λ (P (x ₁ , ..., x ', k ₁ , ..., k')) = 1} - a set of certain graphic elements in the document.

P_G(х,k) - предикат определения наличия графических элементов в документе: значение признака Хаара х (х∈Х), вычисленного на изображении в документе, равно значению признака Хаара в классификаторе k (k∈K).P _G (x, k) is a predicate for determining the presence of graphic elements in a document: the value of the Haar feature x (x∈X), calculated on the image in the document, is equal to the value of the Haar feature in the classifier k (k∈K).

На основе полученных значений формируют систему предикатов идентификации признаков изображения.On the basis of the obtained values, a system of predicates for the identification of image features is formed.

3.3. В случае, когда предикат узнавания реквизитов документа (1) по реквизиту Z₁₈ принимает значение «ложь», графические элементы в информационной части документа по правилу (3) не выявлены, система переходит в блок обучения, где оператор вручную принимает решение об определении ОИО должностных лиц документа.3.3. In the case when the predicate for recognizing the details of the document (1) for _{variable Z 18} takes the value "false", the graphic elements in the informational part of the document according to rule (3) are not identified, the system goes to the training block, where the operator manually decides on determining the IO of officials persons of the document.

4. Идентификация ОИО осуществляется с использованием логики предикатов с помощью формулы, состоящей из двух предикатов (первого - определения ОИО на основе графических документов, второго - определения ОИО на основе текстовых документов) соединенных логической операцией - дизъюнкция.4. Identification of OIO is carried out using predicate logic using a formula consisting of two predicates (the first is the definition of OIO based on graphic documents, the second is the definition of OIO based on text documents) connected by a logical operation - disjunction.

Математическая модель подпроцесса определения ОИО в документе, используя логику предикатов выразится следующей формулой:The mathematical model of the sub-process for determining the OIR in the document, using the logic of predicates, is expressed by the following formula:

Зададим интерпретацию для формулы (4):Let's set the interpretation for formula (4):

Е∈{β₁,…,β'} - множество значений ОИО, где β∈{β₁,…,β'}, β₁ - первое значение ОИО в документе, β' - последнее значение ОИО в документе;Е∈ {β ₁ ,…, β '} is a set of OIR values, where β∈ {β ₁ , ..., β'}, β ₁ is the first OIR value in the document, β 'is the last OIR value in the document;

B_β={b₁,…,b'} - множество значений признаков Хаара, отнесенных к β-й ОИО, b∈{b₁,…,b'}, b₁ - первое значение признака Хаара по β-й ОИО, b' - последнее значение классифицированных признаков Хаара по β-й ОИО;B _β = {b ₁ ,…, b '} is the set of values of Haar features related to the β-th OIR, b∈ {b ₁ , ..., b'}, b ₁ is the first value of the Haar feature according to the β-th OIR, b '- the last value of the classified Haar traits according to the β-th OIO;

G={g₁,…,g'} - множество признаков Хаара классификатора, выявленных в документе, g∈{g₁,…,g'}, g₁ - первое значение признаков Хаара классификатора, выявленных в документе, g' - последнее значение признаков Хаара классификатора, выявленных в документе;G = {g ₁ , ..., g '} is the set of Haar features of the classifier identified in the document, g∈ {g ₁ , ..., g'}, g ₁ is the first value of the Haar features of the classifier identified in the document, g 'is the last the meaning of the Haar classifier features identified in the document;

W={w₁,…,w'} - множество значимых слов, выявленных в документе, w∈{w₁,…,w'}, w₁ - первое значение значимых слов в документе, w' - последнее значение значимых слов в документе;W = {w ₁ , ..., w '} is the set of meaningful words identified in the document, w∈ {w ₁ , ..., w'}, w ₁ is the first meaning of meaningful words in the document, w 'is the last meaning of meaningful words in document;

- множество значимых слов текстов, отнесенных к β-й ОИО, где ω∈{ω₁,…,ω'}, ω₁ - первое значение значимых слов текстов по β-й ОИО, ω' - последнее значение значимых слов текстов по β-й ОИО;

- the set of meaningful words of texts referred to the β-th OIO, where ω∈ {ω ₁ , ..., ω '}, ω ₁ is the first meaning of the meaningful words of the texts according to the β-th OIO, ω' is the last meaning of the significant words of the texts according to β th OIO;

F={ƒ₁,…,ƒ'} - множество вычисленных значений весов значимых слов, выявленных в документе, ƒ∈{ƒ₁,…,ƒ'}, ƒ₁ - первое значение вычисленных значений весов значимых слов в документе, ƒ' - последнее значение вычисленных значений весов значимых слов в документе;F = {ƒ ₁ ,…, ƒ '} - the set of calculated values of the weights of significant words identified in the document, ƒ∈ {ƒ ₁ ,…, ƒ'}, ƒ ₁ - the first value of the calculated values of the weights of significant words in the document, ƒ ' - the last value of the calculated values of the weights of significant words in the document;

- множество весов значимых слов текстов, отнесенных к β-й ОИО, где

- первое значение весов значимых слов текстов по β-й ОИО,

- последнее значение весов значимых слов текстов по β-й ОИО.

- the set of weights of significant words of texts related to the β-th OIO, where

- the first value of the weights of significant words of texts according to the β-th OIO,

- the last value of the weights of the significant words of the texts according to the β-th OIO.

- предикат определения области информационной ответственности на основе графических документов: существует значение признака Хаара b (b∈B_β), отнесенного к β-й ОИО (β∈Е), для которого существует равное ему значение признака Хаара классификатора g (g∈G), выявленного в документе.

- a predicate for determining the area of information responsibility based on graphic documents: there is a value of the Haar attribute b (b∈B _β ), referred to the β-th OIO (β∈E), for which there is an equal value of the Haar attribute of the classifier g (g∈G) identified in the document.

- предикат определения области информационной ответственности на основе текстовых документов: для каждого значения значимого слова, выявленного в тексте документа w (w∈W), существует идентичное ему значение значимого слова текста

отнесенного к β-й ОИО (β∈Е), и для каждого вычисленного значения веса значимого слова, выявленного в тексте документа ƒ (ƒ∈F), существует идентичное ему значение веса значимого слова текста

отнесенного к β-й ОИО (β∈E).

- a predicate for determining the area of information responsibility based on text documents: for each meaning of a meaningful word identified in the text of a document w (w∈W), there is an identical meaning of a meaningful word of the text

referred to the β-th OIO (β∈Е), and for each calculated value of the weight of a significant word identified in the text of the document ƒ (ƒ∈F), there is an identical value of the weight of a significant word of the text

referred to the β-th OIO (β∈E).

Значение β (β∈Е), по которому предикат, записанный по формуле (4) примет истинное значение, будет соответствовать β-й ОИО документа.The value β (β∈Е), according to which the predicate written by formula (4) will take the true value, will correspond to the β-th OIR of the document.

5. Правило построения предиката P_M(U,Z) узнавания метки конфиденциальности документа М={m_λ}, где

λ' - количество определенных в системе меток конфиденциальности выразится следующей формулой:5. The rule for constructing a predicate P _M (U, Z) for recognizing a document confidentiality label M = {m _λ }, where

λ '- the number of confidentiality labels defined in the system is expressed by the following formula:

где

- предикат узнавания k-го значения r-го реквизита;where

- the predicate of recognition of the k-th value of the r-th variable;

m_o - метка конфиденциальности документа d_y, при этом m_o∈М;m _o is the confidentiality label of the document d _y , while m _o ∈М;

u_β - предикат узнавания β-й области, где

β' - количество информационных областей системы.u _β is a predicate of recognition of the β-th region, where

β 'is the number of information areas of the system.

6. После определения метки конфиденциальности документа переходят к формированию проекта резолюции руководителя организации. Реквизит «резолюция», исходя из его определения, представим в виде кортежа данных:6. After determining the confidentiality label of the document, proceed to the formation of a draft resolution of the head of the organization. The attribute "resolution", based on its definition, can be represented as a tuple of data:

где μ_ϕ - наименование должности, либо фамилии и инициалов ϕ-го должностного лица организации (учреждения),

ϕ' - количество должностных лиц, непосредственно подчиненных руководителю и являющихся исполнителями его поручений по поступающим электронным документам;where μ _ϕ is the name of the position, or the surname and initials of the ϕ-th official of the organization (institution),

ϕ '- the number of officials directly subordinate to the head and who are executors of his instructions on incoming electronic documents;

S_ϕχ - χ-е поручение руководителя ϕ-му должностному лицу;S _ϕχ - χ-th order of the head to the ϕ-th official;

- срок исполнения χ^_го поручения руководителя ϕ-му должностному лицу и соответствующий ему атомарный предикат узнавания дат и сроков в информативной части документа;

- the term of execution of the χ ^_ order of the head to the ϕ-th official and the corresponding atomic predicate for recognizing dates and terms in the informative part of the document;

- подпись руководителя.

- signature of the head.

7. Правило построения предиката Р_μ(U,М) узнавания должностного лица организации (учреждения), компетентного в u_β-ой области информационной ответственности, имеющего соответствующий степени ограничения λ допуск и являющегося исполнителем формируемого поручения руководителя (далее - исполнитель) по поступившему электронному документу d выразится следующим образом:7. The rule for constructing a predicate P _μ (U, M) for recognizing an official of an organization (institution) competent in the u _β- th area of information responsibility, having a tolerance corresponding to the degree of restriction λ and who is the executor of the generated order of the head (hereinafter referred to as the executor) according to the received electronic document d will be expressed as follows:

где

- предикат узнавания значения λ метки конфиденциальности m_o поступившего документа d_y,

λ' - общее количество меток конфиденциальности в системе.where

- the predicate of recognition of the value λ of the confidentiality label m _{o of the} received document d _y ,

λ 'is the total number of confidentiality labels in the system.

8. Правило построения предиката выбора поручения из списка готовых поручений будет иметь вид:8. The rule for constructing a predicate for selecting an instruction from the list of ready-made instructions will be as follows:

9. После исполнения всех поручений начальника (резолюции) по документу переходят к распределению документа в дело.9. After the execution of all orders of the chief (resolution) on the document, they proceed to the distribution of the document in the case.

Для этого определяют статью Перечня, к которой может быть отнесен документ. Правило построения предиката узнавания статьи Перечня

который примет значение «истина» при условии, что документ относится к конкретной ОИО (u_β) и имеет конкретный вид (v_j), выражаю следующей формулойTo do this, determine the article of the List to which the document can be attributed. The rule for constructing a predicate for recognizing an entry in the List

which will take the value "true", provided that the document refers to a specific OIO (u _β ) and has a specific form (v _j ), I express by the following formula

где N^s - статья Перечня;where N ^s is an entry in the List;

u_β - область информационной ответственности к которой относится документ;u _β - area of information responsibility to which the document belongs;

v_j - вид формализованного документа.v _j - type of formalized document.

10. Прежде чем определить срок хранения исполненного документа, необходимо определить контур системы электронного документооборота, в котором был разработан и хранится документ (γ).10. Before determining the storage period of the executed document, it is necessary to determine the contour of the electronic document management system, in which the document was developed and stored (γ).

Правило записи предиката узнавания контура, в котором исполненный документ был разработан, P_γ(Z,L), выразится следующей формулой, которая примет значение «истина» при условии, что в реквизитах «место составления», «адресат», «отметка о поступлении» есть уникальные слова

соответствующие искомому контуру:The rule for writing the predicate recognition of the contour in which the executed document was developed, P _γ (Z, L), will be expressed by the following formula, which will take the value "true" provided that the attributes "place of compilation", "addressee", "receipt note »There are unique words

corresponding to the desired contour:

где γ - контур СЭД, в котором был разработан ЭД;where γ is the EDMS contour in which the ED was developed;

z_i - предикат узнавания i-го реквизита;z _i - predicate of recognition of the i-th variable;

- предикат узнавания уникального значения ξ' ключевого слова q i-го реквизита электронного документа, относящегося γ-у контуру;

- the predicate of recognition of the unique value ξ 'of the q keyword of the i-th variable of the electronic document related to the γ-y contour;

i={13,16,22} -реквизиты документа в соответствии с ГОСТ 7.0.97 2016 года;i = {13,16,22} - details of the document in accordance with GOST 7.0.97 2016;

z_i - предикат узнавания i-го реквизита для ЭД, относящегося к контуру.z _i - predicate of recognition of the i-th variable for the ED related to the contour.

11. Далее определяют срок хранения исполненного документа (τ) при помощи предиката узнавания срока хранения ЭД P_τ(U,V,K), который примет значение истина при условии, что ЭД отнесен к конкретной статье Перечня и разработан в конкретном контуре СЭД, правило построения которого, выразится следующей формулой:11. Next, the shelf life of the executed document (τ) is determined using the predicate for recognizing the ED storage period P _τ (U, V, K), which will take on the value true, provided that the ED is assigned to a specific entry in the List and is developed in a specific EDMS contour, the rule the construction of which will be expressed by the following formula:

где P_τ(U,V,K) - предикат узнавания срока хранения (τ) документа;where P _τ (U, V, K) is the predicate for recognizing the storage period (τ) of the document;

v_j - вид формализованного документа;v _j - type of formalized document;

- предикат узнавания g контура СЭД, в котором разработан и хранится исполненный документ;

- recognition predicate g of the EDMS contour, in which the executed document is developed and stored;

g' - контур СЭД, отличный от контура, в котором разработан и хранится исполненный документ.g '- EDMS contour, different from the contour in which the executed document is developed and stored.

12. На основании значений, полученных в (5, 7 и 8) определяют дело (i), в которое будет распределен исполненный документ, используя предикат P_l(N^s,Т,Ω), который примет значение «истина» при условии, что документ отнесен к конкретной статье Перечня и обладает конкретным сроком хранения и относится к области деятельности конкретного подразделения (должностного лица). Правило построения предиката P_l(N^s,T,Ω) выражают следующей формулой:12. Based on the values obtained in (5, 7 and 8), determine the case (i), in which the executed document will be distributed, using the predicate P _l (N ^s , T, Ω), which will take the value "true" provided that that the document is related to a specific article of the List and has a specific storage period and belongs to the area of activity of a specific unit (official). The rule for constructing a predicate P _l (N ^s , T, Ω) is expressed by the following formula:

где l - электронное дело, в котором будет храниться исполненный ЭД;where l is an electronic file in which the executed ED will be stored;

N^s - предикат узнавания статьи Перечня, к которой отнесен ЭД;N ^s - predicate of recognition of the List entry to which the ED is referred;

Т - предикат узнавания срока хранения документа;T is a predicate for recognizing the storage period of a document;

Ω - предикат узнавания должностного лица, к области деятельности которого относится документ;Ω is a predicate of recognition of an official to whose field of activity the document belongs;

τ - срок хранения документа;τ is the storage period of the document;

μ - должностное лицо (структурное подразделение) к области информационной ответственности которого отнесен документ;μ - an official (structural unit) to whose area of information responsibility the document is assigned;

- статья Перечня, к которой отнесен документ.

- the article of the List to which the document belongs.

13. Перед распределением документа в дело необходимо проверить соответствие ограничительных меток конфиденциальности документа и дела. На основании данных, полученных в (4), построят правило записи предиката узнавания разрешения на распределение документа в дело:13. Before distributing a document to a file, it is necessary to verify that the confidentiality restrictive labels of the document and the file are consistent. Based on the data obtained in (4), a rule for recording the predicate of recognition of permission for the distribution of a document in a case will be constructed:

где P_θ(M^d, М^Δ) - предикат узнавания возможности распределения документа в дело;where P _θ (M ^d , M ^Δ ) is a predicate for recognizing the possibility of distributing a document into a case;

m^d - предикат узнавания ограничительной метки конфиденциальности дела;m ^d - the predicate of recognition of the restrictive label of confidentiality of the case;

m^Δ - предикат узнавания ограничительной метки конфиденциальности документа.m ^Δ is the predicate for recognizing the document's confidentiality restrictive label.

Способ автоматической классификации формализованных электронных графических и текстовых документов в системе электронного документооборота с автоматическим формированием электронных дел основывается на работе 3-х классификаторов, для обучения которых необходимо разработать 3 блока наборов обучающих документов, имеющих свои особенности.The method of automatic classification of formalized electronic graphic and text documents in an electronic document management system with automatic generation of electronic files is based on the work of 3 classifiers, for the training of which it is necessary to develop 3 blocks of sets of training documents that have their own characteristics.

Согласно предлагаемому способу каждый текстовый документ d_xпредставляется декартовым произведением переменных из множеств Т × L × W, а каждый графический документ d_y представляется декартовым произведением переменных из множества Т × L × G.According to the proposed method, each text document d _x is represented by a Cartesian product of variables from the sets T × L × W, and each graphic document d _y is represented by a Cartesian product of variables from the set T × L × G.

Особенностями обучающих документов для классификатора текстовых документов по областям информационной ответственности является создание словаря значимых слов, содержащего базовые словоформы всех слов, встречающихся в обучающих документах.The peculiarities of training documents for the classifier of text documents by areas of information responsibility is the creation of a dictionary of significant words containing the basic word forms of all words found in training documents.

При этом для классификации текстового документа в расчет берутся не все словоформы из словаря документов, а лишь те из них, которые входят в рабочий словарь классификатора. В рабочий словарь классификатора включаются наиболее информативные словоформы с точки зрения определения принадлежности документа данной области (метке). Информативность словоформы w_p для классификатора по информационной области u_β определяется по следующей формуле:At the same time, for the classification of a text document, not all word forms from the dictionary of documents are taken into account, but only those that are included in the working dictionary of the classifier. The working vocabulary of the classifier includes the most informative word forms from the point of view of determining the belonging of a document to a given area (label). The informativeness of the word form w _p for the classifier by the information area u _β is determined by the following formula:

В рабочий словарь классификатора включаются все словоформы, не попавшие в стоп-словарь, информативность которых превышает заданный порог информативности ε. Стоп-словарь состоит из словоформ, частоты встречаемости которых во множестве обучающих документов превышают заранее установленный порог δ. При этом могут отсекаться слова, не несущие смысловой нагрузки, такие как предлоги, союзы, вводные и общие слова и т.д. Значения коэффициента δ, согласно данному способу, устанавливаются в пределах от 0.05 до 0.7 и могут быть различны в зависимости от специфики и условий его использования. Количество предикатов в системе предикатов определяются количеством ОИО должностных лиц, на которые необходимо классифицировать документы.The working dictionary of the classifier includes all word forms that are not included in the stop dictionary, the information content of which exceeds a given threshold of information content ε. The stop dictionary consists of word forms, the frequency of occurrence of which in the set of training documents exceeds a predetermined threshold δ. In this case, words that do not carry a semantic load, such as prepositions, conjunctions, introductory and general words, etc., can be cut off. The values of the coefficient δ, according to this method, are set in the range from 0.05 to 0.7 and can be different depending on the specifics and conditions of its use. The number of predicates in the predicate system is determined by the number of OIO officials into which documents must be classified.

Особенностями обучающих документов для классификатора графических документов по областям информационной ответственности является использование документов, информативная часть которых представлена графическим изображением, состоящим из графических элементов, при этом используются не все графические элементы, представленные в документах, а лишь те из них, которые относятся к одной определенной области информационной ответственности.The peculiarities of training documents for the classifier of graphic documents by areas of information responsibility is the use of documents, the informative part of which is represented by a graphic image consisting of graphic elements, while not all graphic elements presented in the documents are used, but only those that relate to one specific areas of information responsibility.

При этом, для классификации графических документов используется метод Виолы-Джонса, основанный на применении n-мерных Хаар-функций, в котором для идентификации графических элементов используется специально обученный классификатор, состоящий из каскадов признаков Хаара. Каждый каскад признаков Хаара может определить в документе один вид графического элемента.At the same time, for the classification of graphic documents, the Viola-Jones method is used, based on the use of n-dimensional Haar functions, in which a specially trained classifier consisting of cascades of Haar features is used to identify graphic elements. Each cascade of Haar features can define one kind of graphic element in a document.

По предъявленному набору классифицированных вручную графических документов, выделяют графические элементы для обучения классификатора. Для обучения классификатора необходимо подготовить набор положительной выборки, представляющий собой изображения графического элемента, который необходимо обнаружить в документе, в различных ракурсах (расположенный под разными углами по отношению к документу, на разном фоне, исполненный в разной цветовой гамме и т.д. в зависимости от вида документа), и набор негативной выборки (не содержащей графический элемент, который необходимо детектировать). Надежность работы классификатора зависит от размера обучаемой выборки, качества подготовленных образцов графических элементов и оптимальности используемых настроек при обучении классификатора (количество уровней каждого каскада признаков Хаара, коэффициент качества обучения, уровень ложной тревоги). Значения коэффициентов, согласно данному способу, устанавливаются в следующих пределах: количество уровней каскадов признаков Хаара больше или равно 18, коэффициент качества обучения от 0.9 до 0.95, уровень ложной тревоги от 0.4 до 0.5 и могут быть различны в зависимости от специфики и условий их использования. Количество предикатов в системе предикатов определяются количеством областей информационной ответственности, на которые необходимо классифицировать документы.According to the presented set of manually classified graphic documents, graphic elements are selected for training the classifier. To train the classifier, it is necessary to prepare a set of positive samples, which are images of a graphic element that must be found in a document, from different angles (located at different angles in relation to the document, on a different background, executed in different colors, etc., depending on from the type of document), and a set of negative samples (not containing the graphic element that must be detected). The reliability of the classifier depends on the size of the trained sample, the quality of the prepared samples of graphic elements and the optimality of the settings used when training the classifier (the number of levels of each cascade of Haar features, the training quality coefficient, the level of false alarm). The values of the coefficients, according to this method, are set within the following limits: the number of levels of cascades of Haar features is greater than or equal to 18, the learning quality coefficient is from 0.9 to 0.95, the false alarm level is from 0.4 to 0.5 and may be different depending on the specifics and conditions of their use. The number of predicates in the predicate system is determined by the number of areas of information responsibility into which documents must be classified.

Особенностями обучающих документов для классификатора формирования электронных дел является их предварительная ручная классификация по делам, с учетом соответствия меток конфиденциальности документа и дела, а также содержащие непустой реквизит «отметка об исполнении и направлении в дело».The peculiarities of the training documents for the classifier of the formation of electronic cases is their preliminary manual classification by cases, taking into account the correspondence of the confidentiality labels of the document and the case, as well as containing a non-empty requisite "mark of execution and direction to the case."

При этом, по предъявленному набору классифицированных вручную документов, формируют систему предикатов идентификации признаков текста, где количество предикатов в системе предикатов определяется количеством статей Перечня, на которые необходимо классифицировать документы. Сохраняют предикаты в базе данных.At the same time, according to the presented set of manually classified documents, a system of predicates for identifying text signs is formed, where the number of predicates in the predicate system is determined by the number of entries in the List into which documents must be classified. Store predicates in the database.

По предъявленному набору классифицированных вручную документов формируют системы предикатов определения контура СЭД, в котором был разработан исполненный документ. Количество предикатов в системе определяется количеством контуров, заданных в информационной системе. Сохраняют системы предикатов в базе данных.Based on the presented set of manually classified documents, predicate systems for determining the EDMS contour, in which the executed document was developed, are formed. The number of predicates in the system is determined by the number of contours specified in the information system. Store predicate systems in the database.

По предъявленному набору классифицированных документов, определяют срок хранения исполненного документа, формируют системы предикатов определения сроков хранения исполненного документа. Количество предикатов в системе определения сроков хранения определяется количеством возможных сроков хранения, определенных Перечнем. Сохраняют системы предикатов в базе данных.Based on the presented set of classified documents, the storage period of the executed document is determined, predicate systems for determining the storage time of the executed document are formed. The number of predicates in the storage time determination system is determined by the number of possible storage times determined by the List. Store predicate systems in the database.

По предъявленному набору классифицированных документов, определяют дело, в которое будет распределен исполненный документ, формируют системы предикатов определения дела, в которое будет распределен исполненного документа. Количество предикатов в системе определения дела, в которое будет распределен исполненный документ, определяется количеством возможных дел, определенных номенклатурой дел организации. Сохраняют системы предикатов в базе данных.Based on the presented set of classified documents, they determine the case in which the executed document will be distributed, form the predicate systems for determining the case into which the executed document will be distributed. The number of predicates in the case definition system, into which the executed document will be distributed, is determined by the number of possible cases determined by the nomenclature of the organization's cases. Store predicate systems in the database.

Функционирование автоматической системы классификации формализованных текстовых и графических документов:Functioning of an automatic classification system for formalized text and graphic documents:

1. Осуществляют преобразование документа из формата хранения в текст на естественном языке.1. Carry out the transformation of the document from the storage format to the text in natural language.

2. Для определения области информационной ответственности должностных лиц, определяют вид информативной части документа (текстовый или графический). Для этого проводят проверку на наличие реквизита - текст документа Z₁₈ с помощью правила (1).2. To determine the area of information responsibility of officials, determine the type of the informative part of the document (text or graphic). To do this, check for the presence of the requisite - the text of the document Z ₁₈ using rule (1).

2.1. В случае, когда система предикатов (1) по реквизиту Z₁₈ принимает значение «истина», осуществляют преобразование слов текста в базовые словоформы, отбрасывают незначимые слова, осуществляют подсчет весов слов в тексте, получившиеся значения подставляют в систему предикатов (4), находящуюся в базе данных. По предикатам, принявшим значение «истина», однозначно определяют области информационной ответственности должностных лиц, к которым относится документ.2.1. In the case when the predicate system (1) for _{variable Z 18} takes the value "true", the words of the text are converted into basic word forms, insignificant words are discarded, the weights of words in the text are calculated, the resulting values are substituted into the predicate system (4) located in database. According to the predicates that have taken the value "true", the areas of information responsibility of officials to whom the document belongs are uniquely determined.

2.2. В случае, когда система предикатов (1) по реквизиту Z₁₈ принимает значение «ложь», осуществляют поиск в документе графических элементов с помощью классификатора по правилу (3), значения выявленных графических элементов подставляют в систему предикатов (4), находящуюся в базе данных. По предикатам, принявшим значение «истина», однозначно определяют области информационной ответственности, к которым относится документ.2.2. In the case when the predicate system (1) for the _{variable Z 18} takes the value "false", the document is searched for graphic elements using the classifier according to rule (3), the values of the identified graphic elements are substituted into the predicate system (4) located in the database ... According to the predicates that have taken the value "true", the areas of information responsibility to which the document belongs are uniquely determined.

2.3. В случае, когда система предикатов (1) по реквизиту Z₁₈ принимает значение «ложь», а поиск в документе графических элементов по правилу (3) не выявил графические элементы, система переходит в блок обучения, где оператор вручную определяет информационную область ответственности документа.2.3. In the case when the predicate system (1) for _{variable Z 18} takes the value "false", and the search for graphic elements in the document according to rule (3) did not reveal graphic elements, the system goes to the training block, where the operator manually determines the information area of responsibility of the document.

3. Используя извлеченные метаданные документа, полученные по (1), определяют соответствующую ему метку конфиденциальности, для чего указанные значения подставляют в систему предикатов, построенных по (5). По предикату, принявшему значение «истина», определяют метку конфиденциальности.3. Using the extracted document metadata obtained by (1), the corresponding confidentiality label is determined, for which the indicated values are substituted into the system of predicates constructed by (5). According to the predicate, which has the value "true", the confidentiality label is determined.

4. Для построения проекта реквизита «резолюция», во-первых, определенные по (4) значения областей информационной ответственности и определенное по (5) значение метки конфиденциальности документа подставляют в систему предикатов, построенных по (6), и по предикатам, принявшим значение «истина», определяют исполнителя. Во-вторых, подставляя в систему предикатов, построенных по (7), значения определенных по (1) реквизитов документа, по (2) вида документа и по (4) области информационной ответственности, а также значения отдельных ключевых слов, по предикатам, принявшим значения «истина» определяют конкретные поручения. Дополняя полученные значения исполнителя и поручения определенными атомарными предикатами узнавания значениями сроков исполнения и определенной любым способом даты поступления документа, получают кортеж данных, который присваивается реквизиту «резолюция» поступившего документа.4. To build the draft variable "resolution", firstly, the values of the areas of information responsibility determined by (4) and the value of the document confidentiality label determined by (5) are substituted into the system of predicates built by (6) and by predicates that have taken the value "True", define the performer. Secondly, substituting into the system of predicates built according to (7), the values determined according to (1) the details of the document, according to (2) the type of document and according to (4) the area of information responsibility, as well as the values of individual keywords, according to the predicates that true meanings identify specific assignments. By supplementing the obtained values of the executor and instructions with certain atomic predicates of recognition with the values of the due dates and the date of receipt of the document determined in any way, a tuple of data is obtained, which is assigned to the "resolution" variable of the received document.

5. Используя значения области информационной ответственности, полученные по (4), и значение вида документа, полученные по (2), подставляют в систему предикатов, построенных по правилу (8), и по предикатам, принявшим значение «истина», определяют статью Перечня, к которой можно отнести исполненный документ. Значения, определенные по (8) статьи Перечня и значения, определенные по (9), подставляют в систему предикатов, построенных по правилу (11) и по предикатам, принявшим значение «истина», определяют дело, в которое будет распределен исполненный документ.5. Using the values of the area of information responsibility, obtained by (4), and the value of the document type, obtained by (2), are substituted into the system of predicates built according to rule (8), and by the predicates that have taken the value "true", determine the entry of the List , to which the executed document can be attributed. The values determined by (8) of the Article of the List and the values determined by (9) are substituted into the system of predicates built according to rule (11) and, according to the predicates that have taken the value "true", determine the case into which the executed document will be distributed.

6. Используя значения определенные по (5) метки конфиденциальности документа и предиката узнавания метки конфиденциальности выбранного дела, подставляя их в систему предикатов построенных по правилу (12) и по предикатам, принявшим значение «истина», определяют возможность распределения исполненного документа в идентифицированное дело.6. Using the values determined by (5) the document confidentiality label and the predicate for recognizing the confidentiality label of the selected case, substituting them into the predicate system built according to rule (12) and according to the predicates that took the value "true", determine the possibility of distributing the executed document into the identified case.

Отметим, что данный способ предназначен для обработки машиночитаемых текстов на естественном языке и формализованных графических документов, изображение которых представлено с помощью определенных условных графических элементов.Note that this method is intended for processing machine-readable natural language texts and formalized graphic documents, the image of which is presented using certain conditional graphic elements.

Сопоставительный анализ заявляемого решения с прототипом показывает, что предлагаемый способ отличается выявлением в информативной части документа графических элементов с использованием правила (3) и определением области информационной ответственности должностных лиц для формализованных графических и текстовых документов с помощью усовершенствованного правила (4).A comparative analysis of the proposed solution with the prototype shows that the proposed method is distinguished by the identification of graphic elements in the informative part of the document using rule (3) and the definition of the area of information responsibility of officials for formalized graphic and text documents using the improved rule (4).

Благодаря новой совокупности существенных признаков способ позволяет автоматизировать процесс распределения формализованных электронных документов по областям информационной ответственности должностных лиц в случаях, когда информативная часть документа представлена реквизитом текст или графическим изображением.Thanks to the new set of essential features, the method allows automating the process of distributing formalized electronic documents in the areas of information responsibility of officials in cases where the informative part of the document is represented by the requisite text or graphic image.

Анализ уровня техники позволил установить, что аналоги, характеризующиеся совокупностью признаков, тождественных признакам заявленного технического решения, отсутствуют, что указывает на соответствие заявленного способа условию патентоспособности «новизна».The analysis of the prior art made it possible to establish that analogues characterized by a set of features identical to the features of the claimed technical solution are absent, which indicates the compliance of the claimed method with the "novelty" condition of patentability.

Результаты поиска известных решений в данной и смежных областях техники с целью выявления признаков, совпадающих с отличительными от прототипа признаками заявленного объекта, показали, что они не следуют явным образом из уровня техники. Из уровня техники также не выявлена известность отличительных существенных признаков, обуславливающих тот же технический результат, который достигнут в заявленном способе. Следовательно, заявленное изобретение соответствует условию патентоспособности «изобретательский уровень».The search results for known solutions in this and related fields of technology in order to identify features that match the distinctive features of the prototype of the features of the claimed object have shown that they do not follow explicitly from the prior art. The prior art also did not reveal the knowledge of the distinctive essential features that determine the same technical result that is achieved in the claimed method. Therefore, the claimed invention meets the "inventive step" requirement of patentability.

Автоматическая классификация формализованных электронных графических и текстовых документов в системе электронного документооборота с автоматическим формированием электронных дел осуществляется следующим образом:Automatic classification of formalized electronic graphic and text documents in the electronic document management system with automatic generation of electronic files is carried out as follows:

1. В режиме классификации.1. In classification mode.

1.1. При появлении в блоке ввода 1 нового формализованного ЭД d он поступает в блок 2, в котором выявляют значения характеристик текста t участков документа и ключевых слов l в них. Значения t и l участков документа поступают в блок 3, где с помощью системы предикатов, построенных по правилу (1) распознаются реквизиты документа z. Информация о распознанных реквизитах документа z поступает в блок 4, где система предикатов, построенная по правилу (2) осуществляет распознавание вида документа v.1.1. When a new formalized ED d appears in the input block 1, it enters block 2, in which the values of the characteristics of the text t sections of the document and the keywords l in them are identified. The values of t and l of document sections enter block 3, where the attributes of the document z are recognized using a system of predicates built according to rule (1). Information about the recognized details of the document z enters block 4, where the predicate system, built according to rule (2), recognizes the type of document v.

1.2. В блоке 5 из поступившего от блока 2 документа d, используя сведения об определенном в блоке 4 виде документа v, который, обладая установленным требованиями нормативных документов формуляром, задает места расположения и значения реквизитов документа, выявляет требуемые значения реквизитов, которые используются как метаданные документа. Из блока 5 документ d и соответствующие ему метаданные z поступают в блок 12, где документ учитывается по своим метаданным и организуется хранение его эталонной копии.1.2. In block 5 from the document d received from block 2, using information about the form of document v defined in block 4, which, having the form established by the regulatory documents, specifies the locations and values of the document attributes, identifies the required attribute values that are used as document metadata. From block 5, document d and the corresponding metadata z enter block 12, where the document is taken into account according to its metadata and the storage of its master copy is organized.

1.3. На основании значений характеристик текста t участков документа и ключевых слов l в них в блоке 6 в системе предикатов, построенных по правилу (1), выявляют наличие реквизита текст документа Z₁₈.1.3. Based on the values of the characteristics of the text of t sections of the document and the keywords l in them in block 6 in the system of predicates built according to rule (1), the presence of the attribute text of the document Z _{18 is} revealed.

1.3.1. В случае, когда система предикатов (1) по реквизиту Z₁₈ принимает значение «истина», информативная текстовая часть документа d' из блока 6 поступает в блок 7, где слова преобразуются в словоформы, и поступают далее в блок 8, где в процессе работы системы происходит создание рабочего словаря из значимых слов.1.3.1. In the case when the predicate system (1) for _{variable Z 18} takes the value "true", the informative text part of the document d 'from block 6 goes to block 7, where the words are converted into word forms, and goes further to block 8, where in the process of work the system creates a working vocabulary from significant words.

Полученные в блоке 7 словоформы d'' поступают также в блок 9, где производится расчет весов ƒ словоформ информативной текстовой части документа, попавших в рабочий словарь.The word forms d '' obtained in block 7 are also sent to block 9, where the calculation of the weights ƒ of the word forms of the informative text part of the document that fell into the working dictionary is performed.

1.3.2. В случае, когда система предикатов (1) по реквизиту Z₁₈принимает значение «ложь», из блока 6 документ d поступает в блок 10, где с помощью классификатора по правилу (3) производится выявление графических элементов g на изображении.1.3.2. In the case when the predicate system (1) for _{variable Z 18} takes the value "false", from block 6 the document d enters block 10, where using the classifier according to rule (3), the graphic elements g in the image are identified.

1.3.3. В случае, когда система предикатов, построенная по правилу (1) по реквизиту Z₁₈, а так же система предикатов, построенная по правилу (4) принимают значения «ложь», система переходит в блок обучения 13 и оператор вручную определяет область информационной ответственности должностных лиц документа.1.3.3. In the case when the predicate system built according to rule (1) for variable Z ₁₈ , as well as the predicate system built according to rule (4) take the values "false", the system goes to training block 13 and the operator manually determines the area of information responsibility of officials persons of the document.

1.4. В блоке 11 на основе поступивших из блока 9 значений весов ƒ полученных словоформ или из блока 10 значений выявленных в документе графических элементов g (в зависимости от вида поступившего документа), происходит распознавание информационной области u_β путем вычисления значений предикатов системы предикатов, построенной по правилу (4). После чего, данные об информационной области u_β, к которой относится документ, передаются в блок 12 и присоединяются к метаданным документа.1.4. In block 11, on the basis of the values of the weights ƒ of the received word forms received from block 9 or from the block 10 of the values of the graphic elements g identified in the document (depending on the type of the received document), the information area u _β is recognized by calculating the values of the predicates of the predicate system built according to the rule (4). After that, the data on the information area u _β , to which the document belongs, is transferred to block 12 and attached to the document's metadata.

1.5. В блоке 14 на основе поступивших из блока 12 реквизитов документа z и полученных в блоке 11 области информационной ответственности u_β на основе системы предикатов, построенной по правилу (5), определяют соответствующую классифицируемому документу метку конфиденциальности m_o. После чего данные о метке конфиденциальности передаются в блок 12 и присоединяются к метаданным документа.1.5. In block 14, based on the details of the document z received from block 12 and the area of information responsibility u _β received in block 11 on the basis of the predicate system built according to rule (5), the confidentiality label m _o corresponding to the classified document is determined. After that, the data on the confidentiality label is transmitted to block 12 and attached to the document metadata.

1.6. В блоке 16 на основе поступивших из блоков 12 и 14 данных при помощи системы предикатов, построенных по правилу (6), определяют исполнителя поступившего документа Zp, передают в блок 12, где сохраняют его для дальнейшей обработки, включая в набор метаданных, а также передают его в блок 15 для выбора адресата. В блоке 16 на основе данных, поступивших из блока 9 и 12, при помощи системы предикатов, построенных по правилу (7), определяют поручение исполнителю. В блоке 16 информативная часть документа проходит обработку с применением атомарных предикатов узнавания сроков исполнения. Все полученные данные объединяются в кортеж и передаются в блок 12, где добавляются в метаданные и присваиваются реквизиту «резолюция».1.6. In block 16, on the basis of the data received from blocks 12 and 14 using a system of predicates built according to rule (6), the executor of the received document Zp is determined, transferred to block 12, where it is saved for further processing, including in the metadata set, and also transmitted it to block 15 to select the addressee. In block 16, on the basis of the data received from block 9 and 12, using a system of predicates built according to rule (7), an instruction is determined to the executor. In block 16, the informative part of the document is processed using atomic predicates for recognizing due dates. All received data are combined into a tuple and transferred to block 12, where they are added to the metadata and assigned to the "resolution" variable.

1.7. Из блока 12 документ d и метаданные МД поступают в блок 15. В блоке 15 на основе значений, поступивших из блока 12, формируют соответствующее метке конфиденциальности ограничение доступа m_o к классифицируемому документу и направляют исполнителю.1.7. From block 12, document d and MD metadata enter block 15. In block 15, based on the values received from block 12, a restriction of access m _o to the classified document corresponding to the confidentiality label is formed and sent to the performer.

1.8. Через блок 16 происходит загрузка документа в информационную систему в соответствии с определенными классами.1.8. Through block 16, the document is loaded into the information system in accordance with certain classes.

1.9. После исполнения документа (выполнения всех указаний (поручений), определенных в резолюции) исполнителем из блока 17 в блок 18 передаются: метаданные о документе МД, находящиеся в системе (реквизиты, нанесенные на документ, вид документа, область информационной ответственности должностных лиц, к которой отнесен документ, сведения о метке конфиденциальности), информация об исполнителе (структурном подразделении организации), исполнившем документ, а так же, из блока 9 в блок 18, передаются веса значимых слов ƒ содержащихся в документе и при помощи системы предикатов, построенных по правилам (8, 9, 10, 11) определяется дело, в которое будет распределен исполненный документ. При помощи системы предикатов, построенной по правилу 12, проверяется соответствие уровня меток конфиденциальности исполненного документа m^d и дела m^Δ, в которое распределяется исполненный документ. Полученные данные объединяются в кортеж и передаются в блок 12, где присваиваются реквизиту «отметка об исполнении и направлении в дело» Z₃₀.1.9. After the execution of the document (execution of all instructions (instructions) defined in the resolution), the executor from block 17 to block 18 transfers: metadata about the MD document located in the system (details applied to the document, type of document, area of information responsibility of officials the document, information about the confidentiality label is attributed), information about the executor (structural unit of the organization) who executed the document, as well as, from block 9 to block 18, the weights of significant words ƒ contained in the document are transmitted and using a system of predicates built according to the rules ( 8, 9, 10, 11) the case is determined in which the executed document will be distributed. Using the predicate system, built according to rule 12, the correspondence of the level of confidentiality marks of the executed document m ^d and the case m ^Δ , in which the executed document is distributed, is checked. The received data are combined into a tuple and transferred to block 12, where the Z ₃₀ variable is assigned to the "execution and direction" variable.

1.10. Из блока 12 документ d и метаданные МД поступают в блок 14. В блоке 14 на основе значений, поступивших из блоков 12 и 18, формируют соответствующее метке конфиденциальности ограничение доступа к классифицируемому документу и направляют в соответствующее дело с учетом установленных ограничений.1.10. From block 12, document d and MD metadata are sent to block 14. In block 14, based on the values received from blocks 12 and 18, a restriction of access to the classified document corresponding to the confidentiality label is formed and sent to the corresponding case taking into account the established restrictions.

2. В режиме дообучения.2. In the mode of additional training.

Режим дообучения системой используется в следующих случаях:The system uses the additional training mode in the following cases:

2.1 в случае отсутствия возможности распознавания системой предикатов реквизитов документа в блоке 3 по значениям переменных документа t и l (в этом случае оператором системы через блок 13 вносятся изменения в систему предикатов блока 3 или определяется реквизит документа «вручную»);2.1 if the system cannot recognize the predicates of the document details in block 3 by the values of the document variables t and l (in this case, the system operator makes changes to the predicate system of block 3 through block 13 or determines the document attribute "manually");

2.2. в случае отсутствия возможности распознавания системой предикатов вида документа в блоке 4 по значениям предикатов системы предикатов блока 3 (в этом случае оператором системы через блок 13 вносятся изменения в систему предикатов блока 4 или определяется вид документа «вручную»);2.2. if the system cannot recognize the predicates of the document type in block 4 by the values of the predicates of the predicate system of block 3 (in this case, the system operator makes changes to the predicate system of block 4 through block 13 or determines the document type “manually”);

2.3. в случае отсутствия возможности распознавания системой предикатов ОИО должностных лиц в блоке 11 по значениям определенных графических элементов, определенных в документе или по значениям весов значимых слов из рабочего словаря, извлеченных из информативной части документа (в этом случае оператором системы через блок 13 вносятся изменения в систему предикатов блока 11 или определяется ОИО документа «вручную»);2.3. in the absence of the possibility of recognition by the system of predicates of the OIO of officials in block 11 by the values of certain graphic elements defined in the document or by the values of the weights of significant words from the working dictionary, extracted from the informative part of the document (in this case, the system operator makes changes to the system through block 13 predicates of block 11 or is determined by the OIO of the document "manually");

2.4. в случае невозможности распознавания системой предикатов метки конфиденциальности документа в блоке 14 по значениям предикатов системы предикатов блока 11 и метаданным блока 12 (в этом случае оператором системы через блок 13 вносятся изменения в систему предикатов блока 14 или определяется метка конфиденциальности «вручную»);2.4. in case of impossibility of recognition by the predicate system of the document confidentiality label in block 14 by the values of predicates of the predicate system of block 11 and the metadata of block 12 (in this case, the system operator makes changes to the predicate system of block 14 through block 13 or determines the confidentiality label "manually");

2.5. в случае внесения в проект резолюции изменений, в части, касающейся поручений, выбранных из списка готовых поручений, через блок 13 осуществляется автоматическое добавление скорректированных поручений в указанный список;2.5. in the event that changes are made to the draft resolution, in the part related to instructions selected from the list of ready instructions, through block 13, the corrected instructions are automatically added to the specified list;

2.6. в случае невозможности распознавания предикатом в блоке 17 дела, в которое должен быть распределен исполненный документ, по значениям, поступившим из блоков 12, 17 (в этом случае оператором системы через блок 12 вносятся изменения в системы предикатов блока 18 или определяется дело, в которое надо распределить исполненный документ, «вручную»).2.6. in case of impossibility of recognition by the predicate in block 17 of the case in which the executed document is to be distributed, according to the values received from blocks 12, 17 (in this case, the system operator through block 12 makes changes to the predicate systems of block 18 or determines the case in which it is necessary distribute the executed document, "manually").

Таким образом, способ позволяет автоматически классифицировать формализованные электронные графические и текстовые документы по ОИО в системе электронного документооборота с автоматическим формированием электронных дел, чем достигается заявленный технический результат.Thus, the method makes it possible to automatically classify formalized electronic graphic and text documents by OIO in an electronic document management system with automatic generation of electronic files, thereby achieving the claimed technical result.

Claims

A method for automatic classification of formalized electronic graphic and text documents in an electronic document management system with automatic generation of electronic files, which consists in determining the areas of a formalized document for extracting metadata and an informative part; for text documents, the document is converted from the storage format to the text in natural language, the words of the processed document are converted into basic word forms, insignificant words are discarded, the weights of words in the document are counted in accordance with the frequencies of their occurrence, based on the recognized attributes and the values of the keywords of these attributes determine the specific type of electronic document, using the values of the word weights in the document, determine the area of information responsibility (hereinafter - OIO); when converting the words of a document into basic word forms, separate words and phrases corresponding to the time intervals for performing the activity determined by the document are selected and left unchanged, thereby forming a vector of data on the timing of the execution of the document; on the basis of certain OIO, as well as a priori information about the structure of the organization (institution), including the relationship of subordination between officials of the organization (institution) and the levels of their admission to various degrees of confidentiality of documents, form the first set of classification signs; on the basis of a certain type of document and OIO to which the document belongs, with the help of predicates for recognizing keywords and individual requisites of the formal part, a second set of classification features is formed; on the basis of the determined OIO and the type of document, a third set of classification signs is formed and the article of the List of documents with their storage periods (hereinafter referred to as the List), stored in the database, to which the executed document can be attributed; on the basis of previously defined details and unique keywords related to these details, a fourth set of classification features is formed, the outline of the electronic document management system (hereinafter - EDMS), in which the document was developed, is determined; on the basis of certain articles of the List and the outline of the EDMS, the fifth set of classification signs is formed and the storage period of the executed document is determined; on the basis of a certain article of the List, a certain storage period of the executed document and the executor of the document, form the sixth set of classification signs and determine the case into which the executed document will be distributed; on the basis of a certain confidentiality label of the document and the known confidentiality label of the case into which the document will be distributed, the compliance of the confidentiality labels of the document and the case in which it is distributed is checked; at the stage of training on a set of manually classified documents, a system of predicates for determining the OIO is formed; form a predicate system for identifying a confidentiality label of a document; save the specified predicate systems in the database; according to the set of documents for which the "resolution" requisite is manually filled in, form a predicate system for identifying the executor of the order according to the received document and a system for identifying the order predicates, save the predicate systems in the database; based on a set of manually classified documents, a system of predicates for determining the List item is formed, a predicate system for determining the EDMS contour in which the document was developed is formed, a predicate system for determining the storage period of a document is formed, these predicate systems are stored in the database; a system of predicates for defining a case is formed into which the executed document will be distributed, a system of predicates for checking the compliance of confidentiality marks of a document and a case into which the document will be distributed is formed, and these predicate systems are stored in the database; when classifying documents with the help of a database, a decision is made on the relevance of a document to each of the information areas and each of the confidentiality labels, the first set of classification features is substituted into the system of predicates for identifying the executor of the assignment, and according to the predicates that have taken the value "true", a decision is made to refer the document to competence specific employees subordinate to the head; substitute the second set of classification signs into the system of order identification predicates and, based on the predicates that have taken the value "true", make a decision on assigning specific orders to executors to execute the received document; the received data about the executor, the order and the deadline, as well as the data obtained in any way on the date of the document's consideration are combined into a data tuple and assigned to the document requisite "resolution"; using the database, make a decision on the relevance of the document to each of the List entries, substitute the third set of classification features into the predicate system for determining the List entry and, based on the predicates that have taken on the value "true", decide to refer the document to a specific List entry; substitute the fourth set of classification features into the system of predicates for determining the EDMS contour and, according to the predicates that have taken the value "true", make a decision on the EDMS contour in which the document was developed; the fifth set of classification features is substituted into the system of predicates for recognizing the storage period of the document, and according to the predicates that have taken the value "true", a decision is made to assign the storage period to the executed document; the sixth set of classification features is substituted into the system of predicates for determining the case, into which the executed document will be distributed, and according to the predicates that have taken the value "true", they decide on the definition of the case, in which the executed document is required document and recognized details in the document determine the type of the document - graphic or text; subject to the definition of a graphic document, using the technology of image recognition of graphic elements - the Viola Jones algorithm, determine the presence in the document of graphic elements classified according to the OIO of officials, in case of detection of graphic elements, the resulting values are substituted into the predicate system by which the OIO of officials for the graphic document; at the stage of training, based on a set of manually classified graphic documents, sets of graphic elements related to specific OIO of officials are formed, on the basis of which, using the technology of image recognition of graphic elements - the Viola Jones algorithm, the classifier of graphic elements is trained; based on the values of the classifier, a system of predicates for determining the OIO of officials is formed, the specified system of predicates is stored in the database; at the stage of classification of a graphic electronic document, a set of the obtained values of the identified graphic elements is substituted into the predicate system for determining the OIO of officials and, according to the predicates that have taken the value "true", a decision is made to refer the graphic document to the identified OIO of officials.