CN111090986A

CN111090986A - Method for correcting errors of official document

Info

Publication number: CN111090986A
Application number: CN201911197178.9A
Authority: CN
Inventors: 李建华; 谢可; 庄莉; 梁懿; 苏江文; 王秋琳; 刘泽三; 邱镇
Original assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-05-01

Abstract

A method for correcting the error of official document includes such steps as detecting the document type, training the document type recognition model by machine learning, and classifying the document type into the types of notice, report, reply, report, letter, meeting summary and request; the error detection step comprises the steps of cutting words through a Chinese word segmentation device, detecting errors from the aspects of word granularity and word granularity, and integrating suspected error results of the two granularity detections to form a suspected error position candidate set; and (3) using a bidirectional character level N-gram LM deep learning model to score characters in the sentence, regarding the position with low score as a position to be corrected, combining the position to be corrected with the context for dictionary word searching, and when all combinations cannot be searched in the dictionary, regarding the combinations as wrong words and adding a wrong position candidate set. The scheme combines the examination requirements of the technical scheme such as the rule of the Chinese language, incomplete content, unclear standing problem, grammar error correction, smoothness detection, context association and the like, and the technical scheme is innovated, reformed and combined, so that the error correction effect of the electronic official document of the enterprise can be effectively improved through testing.

Description

Method for correcting errors of official document

Technical Field

The invention relates to the field of text analysis, in particular to a method for assisting official document error correction.

Background

With the continuous promotion of informatization construction and the rapid development of paperless office work, business departments at all levels generate a large number of electronic documents which serve as information resources for enterprise production and operation, the document quality control and management are directly related to enterprise image and office efficiency, and particularly, the guarantee of the document quality of enterprise official documents is work at a very rich challenge and professional level. Therefore, real-time ubiquitous guiding, error correcting and assisting functions are provided, full help of a manuscript imitation person in a manuscript imitation process is guaranteed to the maximum extent, and quality management of official document contents can be really enhanced from the source.

Although the quality problems of the documents of the enterprise documents are complicated and have different performances, the quality problems can be generally classified into two types: form and content. Namely, a formal problem represented by an element layout and a format error and a content problem represented by an element content deviation. The official document management system can intelligently guide and control the official document format, the literary rule and the like of the official document in real time, and integrates the official document management approval rule and the operation management and control thought of a company into the error correction and check of the electronic official document through a clear and friendly human-computer interaction interface, so that the official document management quality of an enterprise can be greatly improved, the standardized and informationized development is promoted, and the enterprise development is assisted.

The invention provides a method and a system for correcting the error of an enterprise electronic official document, which fully utilize the characteristics of the enterprise official document and design a targeted algorithm and a solution, thereby effectively improving the accuracy, coverage and effect of correcting the error of the enterprise official document.

Disclosure of Invention

Therefore, a method for correcting errors of the document of the official document needs to be provided, and the problem that the error correction of a specific type of document is not comprehensive enough is solved.

In order to achieve the above object, the inventor provides a method for correcting the error of a document of a official document, which comprises the steps of detecting the document type, training a document type recognition model by using machine learning, and classifying the document type into the types of notification, report, batch, report, letter, meeting summary and request;

the error detection step comprises the steps of cutting words through a Chinese word segmentation device, detecting errors from the aspects of word granularity and word granularity, and integrating suspected error results of the two granularity detections to form a suspected error position candidate set;

using a bidirectional character level N-gram LM deep learning model to score characters in a sentence, regarding the part with low score as a position to be corrected, combining the position to be corrected with a context for dictionary word searching, and when all combinations cannot be searched in a dictionary, regarding the combinations as wrong words and adding a wrong position candidate set;

judging whether the input word sequence is in accordance with a given grammar through a traditional language model, analyzing the syntactic structure of a sentence in accordance with the grammar, scoring, and bringing the syntactic structure which is lower than a threshold value into a standard error candidate set;

a knowledge calculation step, which is to correct errors by using local knowledge of text association and text understanding two dimensions, wherein the associated knowledge correction comprises supplementing accurate local knowledge related to an original title by a mode of searching or context pattern matching of the original wrong title in a standard corpus, and using the local knowledge to assist in error correction sequencing; the text understanding and error correction comprises the steps of carrying out semantic analysis on a text to obtain semantic features, and utilizing an LSTMs model to express and apply the semantic features to an error correction sequencing model.

And further, the method also comprises a candidate recall, and also comprises the step of generating the candidate recall by combining the official document specification and content detection and generating an error correction candidate based on an HMM and a graph theory method.

Specifically, the establishment of the culture recognition model comprises the following steps,

based on the dictionary matching method, the vocabulary in the word stock with the culture type K is searched in the text,

and extracting a lexical expression of each title from the text, screening out a newly added lexical expression model, adding the lexical expression model into a candidate pattern library with the type of K, calculating the score of each candidate pattern, and selecting the pattern with the score larger than a threshold value T1 to be added into a pattern library T with the type of K.

Compared with the prior art, the method can fully utilize the characteristics (including clear politics, manufactured and issued by legal authors, legal authority and administrative constraint, strict timeliness and specific style) of the electronic document of the enterprise, combines the examination and check requirements of administrative specification, incomplete content, unclear standing question, grammar error correction, smoothness detection, context association and the like, innovatively modifies and combines the prior technical scheme, and can effectively improve the error correction effect of the electronic document of the enterprise through tests.

Drawings

FIG. 1 is a flow chart according to an embodiment of the present invention;

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

The method for correcting the error of the enterprise official document is to carry out all-around error detection on spelling errors, word errors, grammatical errors, syntax errors, literary specification errors and management and control specifications of company operation on the given enterprise official document, and key steps of document error correction core are error detection, candidate recall and error correction evaluation sequencing. Different from the existing method, the method aims to solve the problem of error correction of the electronic document of the enterprise, integrates the solution thought of the rule and the deep learning module, adds the auditing requirement of the processing regulation of the electronic document of the enterprise and the management and control specification of the enterprise, and constructs the error correction solution thought corresponding to the document of the enterprise. The enterprise electronic official document mainly comprises types of notification, report, batch reply, report, letter, conference summary, letter, request and the like, and has the following remarkable characteristics:

1) there is a clear form of the literary specification, and in order to maintain the seriousness and beauty of the official document form, all the official document elements have a relatively uniform and specific strict format specification. The external representation of various case formats of the official documents, namely the specification of various official document elements in the aspects of fonts, word sizes, placing positions, arrangement modes and the like. In order to maintain the seriousness and beauty of the official document form, all official document elements have relatively uniform and specific strict format rules.

On the basis of the specifications of the lines, such as the fitting of official document titles, there are problems of wear of Zhangguan plum, misuse of culture, combination of culture, and missing elements, which are common examples:

"culture used in combination" means that two legal documents are used simultaneously in the title, and common errors such as "report on … …", "petition on … …", "notice (announcement, resolution) on … … decision (batch)", and the like. The document error correction reasonably selects one of the documents according to the content of the document and the line-text relationship, and the ' request report ' is changed into the ' letter ' if the letter is ' indivisible ' request report '.

The problem of "element missing" is that the general official documents including the culture, the literary units and the affairs are not able to be omitted arbitrarily, and even if they are omitted, it is not preferable to omit two items of "three elements". Common errors are determined by the company XX, modified by the company XX about … …; the official title of only the literature is irregular, such as "Notification" (x issue [ 19 xx ]) 140, notification ", and the like.

2) The subject matter with core content is a concrete presentation surrounding a subject matter, namely a specific expression of various official document elements in a specific official document. How well the privacy level, urgency, etc. are determined for a document alone? How does the letter number of the letter? The title language, the distinction between main transmitting unit and copying unit, the identification of countersigning unit, the structure of text, language, punctuation, etc. all have clear requirements.

On the content theme, for example, according to the company document management method, documents with specific functions must sign a specific functional unit, such as "large conference, event arrangement, meeting office, involving leadership of group company; manpower resource department is required to be signed when personnel management and labor wage items are involved; the financial management items are involved, and a financial headquarter is required to be signed; legal matters such as litigation and arbitration, and the legal department of countersigning. In order to guide the drafter to correctly identify the countersigning unit, the system can automatically judge the functional departments to be countersigned by the official document in a keyword mode according to the information such as the document type, the title and the like to the maximum extent, and display recommended or mandatory options in a column of the countersigning unit by default. For example, documents relating to company secrets should indicate the level and duration of confidentiality wherein "confidential" and "confidential" level documents should also indicate the number of copies. The mark of the letter sending unit should use the whole name of the letter sending unit or the normalized short for the combined line, the host unit is arranged in front, the letter number of the letter sending should include the generation, the year and the serial number of the company, the combined line only marks the letter number of the letter sending of the host unit, etc.

Referring to fig. 1, the present invention fully utilizes the above-mentioned characteristics of the enterprise document, and improves the algorithm and process of the existing document error correction, and the main flow is as follows:

step 1: and preparing data, a dictionary and an error model. And collecting sample data of standard enterprise official document files in the past year, and labeling, classifying and filing according to the culture. According to relevant official document processing regulations, various language standard words and common typical errors are combined, a dictionary is constructed from spelling errors, word errors, grammars, syntax errors, literary standard errors and management and control specifications of company operation, besides general phonetics and shape dictionaries, a language model, a word model, a grammar model, a syntax model and a standard model are constructed for the types of the languages, and the dictionary is used for auxiliary judgment of error detection.

On the basis of the traditional error correction of the spelling error text, an error correction review model is added, a multi-dimensional error detection index is established according to the literary specification of the enterprise official document and the auditing requirement of the subject content, and a corresponding grading mechanism is set.

For example, for the proposed literary specification problem of the above official document title, the subject elements are the three elements of literary units, affairs and literary categories, and the process of establishing the model is as follows:

(1) and searching the vocabulary in the word bank with the culture type K in the text based on a dictionary matching method.

(2) And extracting the lexical expression of each title from the text, screening out a newly added lexical expression model, and adding the newly added lexical expression model into a candidate pattern library with the type of K.

The related modes comprise subject words, relation words, modes and auxiliary stop words, the combination is the basis of generating the mode, and a mode is generated by marking the result of the combination.

The pattern consists of a combination + result;

the format is "combined ═ > entity 1, entity 2, relationship; ";

the combination is a generalized sentence, and the 'entity 1, entity 2, relationship' is collectively called a result;

mode "about + COMPANY + TYPE ═ 2, -1, -2; "in," about + compare + TYPE "is a combination, and as a result, the first 2 bit represents the position of the entity compare, the second-1 bit represents that the position is empty, the third bit represents a relationship, if the third bit is positive, the meaning is the same as the first bit, the position of an entity is represented, if the third bit is negative, the relationship is normalized, the-1 bit represents that the specification is met, and the-2 bit represents that the specification is not met.

(3) And calculating the score of each candidate pattern, and selecting the pattern with the score larger than a threshold value T1 to be added into the pattern library T with the type K.

And matching the pattern library by the subsequent error detection indexes according to the type of the culture, identifying the titles of the corresponding culture by a pattern matching method, calculating the title scores, and selecting the titles with the scores smaller than a threshold t2 to be added into an error candidate set. The score (T | K) of title pattern T with genre K is calculated as shown in the formula:

n (T | K) represents the total number of title instances of type K mined using pattern T, and N (T) represents the total number of title instances of all types using pattern T.

Step 2: and (5) automatically identifying the culture types. The culture recognition is a text classification problem and mainly comprises several main links, such as text data preparation, text preprocessing, text feature processing, training models, model evaluation and output models. Based on enterprise official document documents in various formats throughout the year, model training is carried out by utilizing machine learning, and a culture classifier is constructed and used for automatically identifying the culture of the documents. The scheme is based on a HanLP open source component, classification training is carried out by adopting a naive Bayes method algorithm, a classification model is output, the culture type is identified, and different language models, dictionaries and related error detection auxiliary models are loaded.

And step 3: and (4) error detection. The error detection mainly aims at judging whether errors exist in a text and need to be corrected, the method adopts two modes of a rule and a deep learning model, and the main error detection steps are as follows:

1) and (3) cutting words by a Chinese word segmentation device in the aspect of rules, detecting errors from two aspects of word granularity and word granularity, and integrating suspected error results of the two granularities to form a suspected error position candidate set.

2) And (3) in the aspect of a deep learning model, a bidirectional character level N-gram LM is used for scoring characters in the sentence, and the place with low score is regarded as the position to be corrected. And combining the positions to be corrected with the context for dictionary word searching, and when all the combinations cannot be searched in the dictionary, regarding the combinations as wrong characters and a wrong position candidate set.

3) On the traditional language model, the specification of the line is checked by combining a word/sentence/grammar analysis language model. In the sentence, there is a certain combination relation between words, and the sentence can be divided into different components according to different relations. Judging whether the structure of an input word sequence (generally a sentence) is in accordance with a given grammar, analyzing the syntactic structure of the sentence in accordance with the grammar, scoring, and incorporating the syntactic structure which is lower than a threshold value into a standard error candidate set.

The pattern matching result shows the pattern matching result of the currently selected file, and the pattern matching result comprises the appeared entities, relations, patterns, involved sentences, weights and matched relation words. For example, the title appears with two verbs, as shown:

4) and a knowledge calculation link is added, more accurate local knowledge is provided from two dimensions of text association and text understanding for error correction, and the problem of knowledge generalization in the low-frequency field is solved.

And in the aspect of associated knowledge, the method can be supplemented to a large amount of accurate local knowledge related to the original title in a manner of searching or context pattern matching of the original wrong title in the standard corpus. These precise local knowledge is utilized to assist in error correction sequencing.

In terms of text understanding, it is clearly inappropriate to employ statistically derived language models to correct errors without understanding the content of the sentence expression. The generalization problem of low frequency domain knowledge needs to be solved from the global understanding of the sentence content and the understanding of each component of the sentence. The method specifically comprises the steps of carrying out semantic analysis on a text to obtain semantic features (such as Recurrent Neural Networks (RNNs)), utilizing an LSTMs (Long short-Term Memory, long-and-short-Term Memory model), wherein the model can better express dependence on the long-and-short-Term Memory model, and is applied to an error correction sequencing model to obtain a better error correction result compared with common RNNs (Neural Networks) only by making hands and feet on a hidden layer.

5) And integrating the four steps, carrying out all-around error detection on spelling errors, word errors, grammars, syntax errors, literary specification errors and management and control specifications of company operation, and generating a final error candidate set.

And 4, step 4: and (6) candidate recalls. Different recall strategies are employed for different types of errors.

1) And generating a corresponding confusion set aiming at the types of pronunciation confusion, shape confusion and the like, and selectively replacing the confusion set, the rules, the word list or the language model.

2) And performing mode replacement according to the rule base aiming at the types of word incompactness, behavior specification incompactness, content attribute incorredness and the like.

3) And generating candidate recalls by combining the official document specifications and content detection, and generating error correction candidates based on an HMM and a graph theory method. And by using the scoring function of the self, pre-screening can be carried out in the process of generating error correction candidates. The error-correcting words are regarded as original words and obtained through transition matrix conversion, how the error-correcting words (represented by S) obtain the maximum possible original words (represented by T) is known, the error-correcting words are obtained through conversion by a Bayesian formula, and for specific S, P (S) is invariable, P (T) is prior probability, P (S | T) is transition probability, and the error-correcting words and the transition probabilities can be obtained by establishing a language model and a transition matrix (also called an error model) based on a training corpus.

The determinants for generating an Error correction candidate are two, one is the language model of the candidate T and one is a conditional probability model, also called Error model. The main difference between the different types of methods is the error model. If only replacement errors are considered, then it is understood as a post-alignment character error model.

And 5: and (4) evaluating and correcting error correction. And after all suspected errors are positioned through error detection, all candidate processing modes are selected, candidate schemes are used for replacement, candidate sorting results of similar translation models are obtained based on the language model, and an optimal correction scheme is obtained. And evaluating based on a language model, scoring by using the confusion degree and mutual information of sentences, scoring by using a forward algorithm and character grades and other language models, training a discrimination model by using multi-class statistical characteristics, and if no score in the candidate sentences is higher than that of the original sentences or is not higher than a threshold value compared with that of the original scores, determining that the original sentences have no errors. Otherwise, the candidate sentence with the highest score is output as the error correction result.

Step 6: and (4) intelligent recommendation. Based on the error correction, aiming at the error types, combining word/syntax analysis results, recommending standard and standard expressions of the same type of languages, classical case writing and related material libraries, and providing auxiliary reference.

The steps are the innovative method for realizing the error correction of the official document electronic document. The electronic official document error correction system developed based on the method can well correct the document errors of various official documents, including notification, report, batch, report, letter, meeting summary, letter, request and the like, provides real-time ubiquitous guide, error correction and auxiliary functions, furthest ensures that a manuscript imitation person obtains all-round help in the manuscript imitation process, and can fundamentally strengthen the quality management of official document contents.

It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims

1. A method for correcting the error of a document of official documents comprises the steps of detecting the document type, training a document type recognition model by using machine learning, and classifying the document type into the types of notification, report, batch, report, letter, meeting summary and request;

2. The method of correcting errors in a document according to claim 1, further comprising generating candidate recalls based on HMMs and graph theory methods in conjunction with document travel specifications and content detection.

3. The method of claim 1, wherein said document identification model building comprises the steps of,