CN115935964A

CN115935964A - Method for correcting text content of bidding document

Info

Publication number: CN115935964A
Application number: CN202211525733.8A
Authority: CN
Inventors: 徐世阳; 陈丽娟; 杨德胜; 张丽娟; 向洪伟; 敖翔; 史春胜; 巫俊洁; 孟战; 昝云飞; 纪传俊; 王光强; 唐乐
Original assignee: State Grid Chongqing Tendering Co; State Grid Corp of China SGCC; State Grid Chongqing Electric Power Co Ltd
Current assignee: State Grid Chongqing Tendering Co; State Grid Corp of China SGCC; State Grid Chongqing Electric Power Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-04-07

Abstract

The invention relates to a method for correcting the text content of a bidding document, which comprises the following steps: s1: selecting a plurality of bidding documents; s2: constructing a text content error correction model, wherein the text content error correction model comprises a rule error correction module and a model error correction module; s3: carrying out error correction processing on the bidding document by utilizing a dictionary rule, an editing distance and an n-element language model to obtain an error correction primary screening text; s4: and (4) carrying out error correction processing on the error correction preliminary screening text by using a model error correction module, and outputting the error correction preliminary screening text as a bidding document with an error correction label. The method and the device can quickly position and identify the position of the error correction target in the bidding text, and realize automatic error correction.

Description

Method for correcting text content of bidding document

Technical Field

The invention relates to the field of computer machine learning, in particular to a method for correcting errors of text contents of bidding documents.

Background

With the rapid development of information network technology, the implementation mode of bidding in the field of power industry is gradually changed from off-line implementation to on-line electronic bidding mode, and the interaction is performed in various bidding platforms in the form of electronic versions such as word, excel, pdf and the like no matter whether the bidding purchasing announcement and the bidding purchasing file issued by the bidding party or the file data such as the bidding file and the bidding quotation list submitted by the bidding party. However, the content information of the files to be acquired is large, and relates to acquisition projects, acquisition quantities, budget amounts, construction periods, price limits, technical requirements and the like, generally the files are long in space and large in content, various electronic versions of the files to be acquired are formed by manual editing of professional workers, problems of unclear description of text content, wrongly written characters, quantity units, time formats and the like are inevitable, correct understanding of related parties on the file content is affected, unnecessary clarification and the like are caused, and therefore the electronic bidding files to be uploaded and released are generally required to be audited for multiple times by the workers, time and labor are wasted in the process, and accuracy is difficult to ensure. Therefore, there is a certain necessity in developing an automatic text error correction system, which can not only release manpower, but also improve the correction efficiency and accuracy rate compared with manual error correction.

Disclosure of Invention

Aiming at the problems in the prior art, the technical problems to be solved by the invention are as follows: how to implement automatic error correction for content in a bid document.

In order to solve the technical problem, the invention adopts the following technical scheme: a method for correcting the error of the text content of a bidding document comprises the following steps:

s100: selecting a plurality of bidding documents;

s200: constructing a text content error correction model, wherein the text content error correction model comprises a rule error correction module and a model error correction module;

the rule error correction module comprises a dictionary rule, an editing distance and an n-element language model; wherein the dictionary rules are professional words and general grammar rules in the industry;

the model error correction module comprises a MacBERT4CSC model and a T5 model;

s300: carrying out error correction processing on the bidding document by utilizing a dictionary rule, an editing distance and an n-element language model to obtain an error correction primary screening text;

s400: and carrying out error correction processing on the error correction primary screening text by using a model error correction module, sequentially passing through a MacBERT4CSC model and a T5 model, and then outputting the bidding document with the error correction label by the T5 model.

Preferably, in S300, the step of performing error correction preliminary screening on the bidding document by using the dictionary rule and the editing distance to obtain an error correction preliminary screening text includes the following specific steps:

s310: correcting errors of the bidding documents based on the dictionary rules, and labeling error correction contents to obtain a mapping dictionary;

s320: and carrying out error correction processing on the mapping dictionary based on the editing distance, and labeling error correction contents to obtain an error correction text, wherein the expression of the editing distance is as follows:

where a and b represent two strings, i represents the first i characters of a, j represents the first j characters of b, lev _a,b (i, j) represents the column-Wenstein distance between the first i characters of a and the first j characters of b;

s330: and carrying out error correction processing on the error correction text based on the n-element language model to obtain an error correction preliminary screening text.

By using a comprehensive means of the dictionary rule, the editing distance and the n-element language model to correct the errors, the document can be corrected more accurately. Firstly, error detection is carried out: detecting errors from the aspect of character granularity and word granularity by virtue of Jieba word segmentation, and integrating suspected error results of the two granularities to form a candidate set of suspected error positions; and then, error correction is carried out, namely, all suspected error positions are traversed, the words at the error positions are replaced by using similar dictionaries, the sentence confusion degree is calculated through the edit distance, the n-element language model and the like, and results of all candidate sets are compared and sequenced to obtain the optimal corrected words. Benefits and advantages of the present invention.

Preferably, the specific steps of performing error correction processing on the error corrected text based on the n-gram language model in S330 are as follows:

s331: setting a parameter value of a sliding window algorithm as n, and setting a score threshold value;

s332: performing n-gram calculation scoring on each word in the error correction text by adopting a sliding window algorithm and an n-gram language model, and performing average absolute deviation calculation processing on the scoring result of each word to obtain a deviation result score of each word;

s333: comparing the dispersion result score of each word with a score threshold, marking the words larger than the score threshold as suspected errors, and performing pronunciation and font replacement on the suspected errors through a dictionary rule to obtain an error correction candidate set;

s334: substituting the t-th replacement word in the error correction candidate set into the original position in the error correction text, wherein the original position in the error correction text is the position of the sentence before the words in the error correction candidate set are replaced, calculating the score of the sentence smoothness degree after the t-th replacement word is substituted by utilizing the PPL confusion degree, and calculating the expression as follows:

where PP () represents the degree of confusion, S represents a sentence, N represents the sentence length, P (i) represents the probability of the ith word, w _i Representing the ith word in the sentence S;

s335: setting a confusion threshold, and marking the original word corresponding to the tth replacement word as an error correction word when the calculated sentence smoothness score is smaller than the confusion threshold; otherwise, the word is considered to be a correctly used word;

s336: and traversing all words in the error correction candidate set to obtain an error correction preliminary screening text.

Preferably, the bid-up document selected in S100 is in a word format.

The technical scheme can support various text formats, the word format is a wide text format at present, the adaptation is simpler, the processing is convenient for a client, the format conversion is not needed, and the resource waste is caused.

Compared with the prior art, the invention has at least the following advantages:

1. the method selects a method combining the rule error correction and the model error correction to carry out error correction processing on the bidding document, the dictionary rule in the rule error correction can be made into a rule with pertinence according to different use fields for error correction, the editing distance and the n-element language model have universality and are effective to the error correction of the alphabetic language and the Chinese language; the model error correction selects a MacBERT4CSC model and a T5 model, an error detection and correction network is added into the MacBERT4CSC model, the Chinese spelling error correction task is adapted, the effect is better, the T5 model adopts a replay span (small segment replacement) method, adjacent [ M ] in the Mask are combined into a special symbol, and each small segment replaces one special symbol to improve the calculation efficiency. The error correction content can be identified more accurately by combining two methods with different pertinence, namely rule error correction and model error correction.

2. The error correction system established by the technical method adopts a BS framework and has a relatively friendly user interaction function.

3. The technical method adopts a method of combining the rule error correction and the pre-training model error correction, not only can realize the error correction of the content of the bidding document, but also can further improve the calculation efficiency of the model error correction and the accuracy of the error correction of the content of the bidding document.

4. The error correction system for bearing the technical method supports online modification after file analysis, a user can operate online, and meanwhile, the position of the text content error identified by the model is quickly located, so that the workload of the user is reduced.

5. The technical scheme has the advantages that the purpose of the content of error correction is clear, the loaded error correction system is simple and convenient to operate, the error correction target can be realized without providing a large batch of training data by a user, and the method is simple, convenient and quick and is more beneficial to use.

6. The technical scheme has wide requirements on text content and format, and the format and the content of the conventional bidding document can be identified.

Drawings

FIG. 1 is a schematic diagram of an edit distance algorithm according to the present invention.

Detailed Description

The present invention is described in further detail below.

The invention provides a method for correcting the text content of a bidding document. The method is a solution for error correction based on rule and multi-model fusion. Firstly, based on rules, respectively detecting and correcting the text content in terms of word granularity, and then counting sentence puzzleness by adopting a language model to correct the text content; after the rule-based flow error correction is finished, the text content is further corrected and corrected again by adopting various deep learning algorithms, so that the effect of optimal error correction check is achieved.

Referring to fig. 1, a method for correcting errors of text contents of a bidding document includes the following steps:

s100: selecting a plurality of bidding documents;

the bid-up document selected in the S100 is in a word format.

the rule error correction module comprises a dictionary rule, an editing distance and an n-element language model; wherein the dictionary rules are professional words and general grammar rules in the industry; the dictionary rules are formulated according to industry experience, and are mainly formulated by performing label removal and word segmentation on data such as daily newspaper corpora of people and the like to obtain high-quality corpora data, establishing a confusion word set, a Chinese dictionary, an approximate pronunciation dictionary and a shape and near word dictionary, and finally fusing the rules in the word dictionaries; the editing distance is also called as Levenshtein distance, and the editing distance and the n-element language model are both the prior art;

the model error correction module comprises a MacBERT4CSC model and a T5 model; the MacBERT4CSC model and the T5 model are the existing error correction technology;

s300: carrying out error correction processing on the bidding document by utilizing the dictionary rule, the editing distance and the n-element language model to obtain an error correction primary screening text;

in the S300, the step of performing error correction preliminary screening on the bid-marking document by using the dictionary rule and the editing distance to obtain an error correction preliminary screening text specifically includes the following steps:

s310: correcting errors of the bidding documents based on the dictionary rules, and labeling error correction contents to obtain a mapping dictionary; in a common bidding document, pinyin spelling or related near sound errors of some common nouns or professional terms exist, so that error correction operation of the common nouns can be completed preliminarily by establishing mapping dictionaries of related pinyins and entity nouns.

represents an indicator function (indicator function) when a _i ≠b _j At time, its value is 0, otherwise it equals 1; the first of the min operations in the formulas represents a delete character (from a) (to reach b) operation, the second represents an insert character operation, and the third represents a replace (depending on whether the current characters are the same or not);

s330: and carrying out error correction processing on the error correction text based on the n-element language model to obtain an error correction primary screening text.

The specific steps of performing error correction processing on the error correction text based on the n-gram language model in S330 are as follows:

s331: setting a parameter value of a sliding window algorithm as n, and setting a score threshold value; the sliding window algorithm is prior art;

s335: setting a perplexity threshold value, and when the calculated sentence currency degree score is smaller than the perplexity threshold value, marking the original word corresponding to the t-th replacement word as an error correction word; otherwise, the word is considered to be a correct use word;

Advantages of the MacBERT4CSC error correction model: any BERT model can be selected on the model network structure of the MacBERT4CSC, and the method is mainly characterized in that different MLM task designs are adopted during pre-training: selecting candidate tokens for masking by using a whole word masking (wwm) and an N-gram masking strategy; BERT-like models typically use [ MASK ] to MASK the original word, while MacBERT4CSC uses a third party synonym tool to generate a near-synonym for the target word for use in masking the original word, in particular, when the original word has no near-synonym, a random n-gram is used to MASK the original word; the method of use is similar to the BERT pre-training model, but the MacBERT4CSC model does not require training and is more suitable for use in chinese correction input where 80% of the words are replaced with near-sense words (formerly [ MASK ]), 10% of the words are replaced with random words, and 10% of the words are unchanged. The model-specific network structure facilitates the application of error correction scenarios.

The T5 model has the advantages that: the improved classical transform model, with both decoder and encoder, is more robust to processing a given input, yielding a corresponding output scene.

By utilizing the technical scheme, a corresponding operating system can be established, the system is fused by various algorithms and rules, the accuracy rate of the system can basically meet the requirement, and the specific operating process of the system is as follows: a user logs in an operating system at a browser end, uploads a bidding document to be verified, and clicks a verification button, so that the verification work can be automatically completed in the whole process; meanwhile, after the model verification is finished, the position with the error of the content, namely the error correction content, can be displayed by highlighting, and then a user clicks and downloads to obtain a file after the verification; the system supports online editing, and a user can further perform verification and editing according to the verification result of the model.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. A method for correcting the error of the text content of a bidding document is characterized by comprising the following steps: the method comprises the following steps:

s100: selecting a plurality of bidding documents;

the model error correction module comprises a MacBERT4CSC model and a T5 model;

2. A method of correcting errors in the textual contents of a bid document according to claim 1, characterized in that: in the S300, the step of performing error correction preliminary screening on the bid-marking document by using the dictionary rule and the editing distance to obtain an error correction preliminary screening text specifically includes the following steps:

3. A method of correcting errors in the textual contents of a bid document according to claim 2, characterized in that: the specific steps of performing error correction processing on the error correction text based on the n-gram language model in S330 are as follows:

4. A method of correcting errors in the textual contents of a bid document according to claim 3, characterized in that: the bid-up document selected in the S100 is in a word format.