CN115935964A - Method for correcting text content of bidding document - Google Patents

Method for correcting text content of bidding document Download PDF

Info

Publication number
CN115935964A
CN115935964A CN202211525733.8A CN202211525733A CN115935964A CN 115935964 A CN115935964 A CN 115935964A CN 202211525733 A CN202211525733 A CN 202211525733A CN 115935964 A CN115935964 A CN 115935964A
Authority
CN
China
Prior art keywords
error correction
model
text
word
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211525733.8A
Other languages
Chinese (zh)
Inventor
徐世阳
陈丽娟
杨德胜
张丽娟
向洪伟
敖翔
史春胜
巫俊洁
孟战
昝云飞
纪传俊
王光强
唐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Chongqing Tendering Co
State Grid Corp of China SGCC
State Grid Chongqing Electric Power Co Ltd
Original Assignee
State Grid Chongqing Tendering Co
State Grid Corp of China SGCC
State Grid Chongqing Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Chongqing Tendering Co, State Grid Corp of China SGCC, State Grid Chongqing Electric Power Co Ltd filed Critical State Grid Chongqing Tendering Co
Priority to CN202211525733.8A priority Critical patent/CN115935964A/en
Publication of CN115935964A publication Critical patent/CN115935964A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to a method for correcting the text content of a bidding document, which comprises the following steps: s1: selecting a plurality of bidding documents; s2: constructing a text content error correction model, wherein the text content error correction model comprises a rule error correction module and a model error correction module; s3: carrying out error correction processing on the bidding document by utilizing a dictionary rule, an editing distance and an n-element language model to obtain an error correction primary screening text; s4: and (4) carrying out error correction processing on the error correction preliminary screening text by using a model error correction module, and outputting the error correction preliminary screening text as a bidding document with an error correction label. The method and the device can quickly position and identify the position of the error correction target in the bidding text, and realize automatic error correction.

Description

Method for correcting text content of bidding document
Technical Field
The invention relates to the field of computer machine learning, in particular to a method for correcting errors of text contents of bidding documents.
Background
With the rapid development of information network technology, the implementation mode of bidding in the field of power industry is gradually changed from off-line implementation to on-line electronic bidding mode, and the interaction is performed in various bidding platforms in the form of electronic versions such as word, excel, pdf and the like no matter whether the bidding purchasing announcement and the bidding purchasing file issued by the bidding party or the file data such as the bidding file and the bidding quotation list submitted by the bidding party. However, the content information of the files to be acquired is large, and relates to acquisition projects, acquisition quantities, budget amounts, construction periods, price limits, technical requirements and the like, generally the files are long in space and large in content, various electronic versions of the files to be acquired are formed by manual editing of professional workers, problems of unclear description of text content, wrongly written characters, quantity units, time formats and the like are inevitable, correct understanding of related parties on the file content is affected, unnecessary clarification and the like are caused, and therefore the electronic bidding files to be uploaded and released are generally required to be audited for multiple times by the workers, time and labor are wasted in the process, and accuracy is difficult to ensure. Therefore, there is a certain necessity in developing an automatic text error correction system, which can not only release manpower, but also improve the correction efficiency and accuracy rate compared with manual error correction.
Disclosure of Invention
Aiming at the problems in the prior art, the technical problems to be solved by the invention are as follows: how to implement automatic error correction for content in a bid document.
In order to solve the technical problem, the invention adopts the following technical scheme: a method for correcting the error of the text content of a bidding document comprises the following steps:
s100: selecting a plurality of bidding documents;
s200: constructing a text content error correction model, wherein the text content error correction model comprises a rule error correction module and a model error correction module;
the rule error correction module comprises a dictionary rule, an editing distance and an n-element language model; wherein the dictionary rules are professional words and general grammar rules in the industry;
the model error correction module comprises a MacBERT4CSC model and a T5 model;
s300: carrying out error correction processing on the bidding document by utilizing a dictionary rule, an editing distance and an n-element language model to obtain an error correction primary screening text;
s400: and carrying out error correction processing on the error correction primary screening text by using a model error correction module, sequentially passing through a MacBERT4CSC model and a T5 model, and then outputting the bidding document with the error correction label by the T5 model.
Preferably, in S300, the step of performing error correction preliminary screening on the bidding document by using the dictionary rule and the editing distance to obtain an error correction preliminary screening text includes the following specific steps:
s310: correcting errors of the bidding documents based on the dictionary rules, and labeling error correction contents to obtain a mapping dictionary;
s320: and carrying out error correction processing on the mapping dictionary based on the editing distance, and labeling error correction contents to obtain an error correction text, wherein the expression of the editing distance is as follows:
Figure BDA0003973011120000021
where a and b represent two strings, i represents the first i characters of a, j represents the first j characters of b, lev a,b (i, j) represents the column-Wenstein distance between the first i characters of a and the first j characters of b;
s330: and carrying out error correction processing on the error correction text based on the n-element language model to obtain an error correction preliminary screening text.
By using a comprehensive means of the dictionary rule, the editing distance and the n-element language model to correct the errors, the document can be corrected more accurately. Firstly, error detection is carried out: detecting errors from the aspect of character granularity and word granularity by virtue of Jieba word segmentation, and integrating suspected error results of the two granularities to form a candidate set of suspected error positions; and then, error correction is carried out, namely, all suspected error positions are traversed, the words at the error positions are replaced by using similar dictionaries, the sentence confusion degree is calculated through the edit distance, the n-element language model and the like, and results of all candidate sets are compared and sequenced to obtain the optimal corrected words. Benefits and advantages of the present invention.
Preferably, the specific steps of performing error correction processing on the error corrected text based on the n-gram language model in S330 are as follows:
s331: setting a parameter value of a sliding window algorithm as n, and setting a score threshold value;
s332: performing n-gram calculation scoring on each word in the error correction text by adopting a sliding window algorithm and an n-gram language model, and performing average absolute deviation calculation processing on the scoring result of each word to obtain a deviation result score of each word;
s333: comparing the dispersion result score of each word with a score threshold, marking the words larger than the score threshold as suspected errors, and performing pronunciation and font replacement on the suspected errors through a dictionary rule to obtain an error correction candidate set;
s334: substituting the t-th replacement word in the error correction candidate set into the original position in the error correction text, wherein the original position in the error correction text is the position of the sentence before the words in the error correction candidate set are replaced, calculating the score of the sentence smoothness degree after the t-th replacement word is substituted by utilizing the PPL confusion degree, and calculating the expression as follows:
Figure BDA0003973011120000031
where PP () represents the degree of confusion, S represents a sentence, N represents the sentence length, P (i) represents the probability of the ith word, w i Representing the ith word in the sentence S;
s335: setting a confusion threshold, and marking the original word corresponding to the tth replacement word as an error correction word when the calculated sentence smoothness score is smaller than the confusion threshold; otherwise, the word is considered to be a correctly used word;
s336: and traversing all words in the error correction candidate set to obtain an error correction preliminary screening text.
Preferably, the bid-up document selected in S100 is in a word format.
The technical scheme can support various text formats, the word format is a wide text format at present, the adaptation is simpler, the processing is convenient for a client, the format conversion is not needed, and the resource waste is caused.
Compared with the prior art, the invention has at least the following advantages:
1. the method selects a method combining the rule error correction and the model error correction to carry out error correction processing on the bidding document, the dictionary rule in the rule error correction can be made into a rule with pertinence according to different use fields for error correction, the editing distance and the n-element language model have universality and are effective to the error correction of the alphabetic language and the Chinese language; the model error correction selects a MacBERT4CSC model and a T5 model, an error detection and correction network is added into the MacBERT4CSC model, the Chinese spelling error correction task is adapted, the effect is better, the T5 model adopts a replay span (small segment replacement) method, adjacent [ M ] in the Mask are combined into a special symbol, and each small segment replaces one special symbol to improve the calculation efficiency. The error correction content can be identified more accurately by combining two methods with different pertinence, namely rule error correction and model error correction.
2. The error correction system established by the technical method adopts a BS framework and has a relatively friendly user interaction function.
3. The technical method adopts a method of combining the rule error correction and the pre-training model error correction, not only can realize the error correction of the content of the bidding document, but also can further improve the calculation efficiency of the model error correction and the accuracy of the error correction of the content of the bidding document.
4. The error correction system for bearing the technical method supports online modification after file analysis, a user can operate online, and meanwhile, the position of the text content error identified by the model is quickly located, so that the workload of the user is reduced.
5. The technical scheme has the advantages that the purpose of the content of error correction is clear, the loaded error correction system is simple and convenient to operate, the error correction target can be realized without providing a large batch of training data by a user, and the method is simple, convenient and quick and is more beneficial to use.
6. The technical scheme has wide requirements on text content and format, and the format and the content of the conventional bidding document can be identified.
Drawings
FIG. 1 is a schematic diagram of an edit distance algorithm according to the present invention.
Detailed Description
The present invention is described in further detail below.
The invention provides a method for correcting the text content of a bidding document. The method is a solution for error correction based on rule and multi-model fusion. Firstly, based on rules, respectively detecting and correcting the text content in terms of word granularity, and then counting sentence puzzleness by adopting a language model to correct the text content; after the rule-based flow error correction is finished, the text content is further corrected and corrected again by adopting various deep learning algorithms, so that the effect of optimal error correction check is achieved.
Referring to fig. 1, a method for correcting errors of text contents of a bidding document includes the following steps:
s100: selecting a plurality of bidding documents;
the bid-up document selected in the S100 is in a word format.
S200: constructing a text content error correction model, wherein the text content error correction model comprises a rule error correction module and a model error correction module;
the rule error correction module comprises a dictionary rule, an editing distance and an n-element language model; wherein the dictionary rules are professional words and general grammar rules in the industry; the dictionary rules are formulated according to industry experience, and are mainly formulated by performing label removal and word segmentation on data such as daily newspaper corpora of people and the like to obtain high-quality corpora data, establishing a confusion word set, a Chinese dictionary, an approximate pronunciation dictionary and a shape and near word dictionary, and finally fusing the rules in the word dictionaries; the editing distance is also called as Levenshtein distance, and the editing distance and the n-element language model are both the prior art;
the model error correction module comprises a MacBERT4CSC model and a T5 model; the MacBERT4CSC model and the T5 model are the existing error correction technology;
s300: carrying out error correction processing on the bidding document by utilizing the dictionary rule, the editing distance and the n-element language model to obtain an error correction primary screening text;
in the S300, the step of performing error correction preliminary screening on the bid-marking document by using the dictionary rule and the editing distance to obtain an error correction preliminary screening text specifically includes the following steps:
s310: correcting errors of the bidding documents based on the dictionary rules, and labeling error correction contents to obtain a mapping dictionary; in a common bidding document, pinyin spelling or related near sound errors of some common nouns or professional terms exist, so that error correction operation of the common nouns can be completed preliminarily by establishing mapping dictionaries of related pinyins and entity nouns.
S320: and carrying out error correction processing on the mapping dictionary based on the editing distance, and labeling error correction contents to obtain an error correction text, wherein the expression of the editing distance is as follows:
Figure BDA0003973011120000051
where a and b represent two strings, i represents the first i characters of a, j represents the first j characters of b, lev a,b (i, j) represents the column-Wenstein distance between the first i characters of a and the first j characters of b;
Figure BDA0003973011120000053
represents an indicator function (indicator function) when a i ≠b j At time, its value is 0, otherwise it equals 1; the first of the min operations in the formulas represents a delete character (from a) (to reach b) operation, the second represents an insert character operation, and the third represents a replace (depending on whether the current characters are the same or not);
s330: and carrying out error correction processing on the error correction text based on the n-element language model to obtain an error correction primary screening text.
The specific steps of performing error correction processing on the error correction text based on the n-gram language model in S330 are as follows:
s331: setting a parameter value of a sliding window algorithm as n, and setting a score threshold value; the sliding window algorithm is prior art;
s332: performing n-gram calculation scoring on each word in the error correction text by adopting a sliding window algorithm and an n-gram language model, and performing average absolute deviation calculation processing on the scoring result of each word to obtain a deviation result score of each word;
s333: comparing the dispersion result score of each word with a score threshold, marking the words larger than the score threshold as suspected errors, and performing pronunciation and font replacement on the suspected errors through a dictionary rule to obtain an error correction candidate set;
s334: substituting the t-th replacement word in the error correction candidate set into the original position in the error correction text, wherein the original position in the error correction text is the position of the sentence before the words in the error correction candidate set are replaced, calculating the score of the sentence smoothness degree after the t-th replacement word is substituted by utilizing the PPL confusion degree, and calculating the expression as follows:
Figure BDA0003973011120000052
where PP () represents the degree of confusion, S represents a sentence, N represents the sentence length, P (i) represents the probability of the ith word, w i Representing the ith word in the sentence S;
s335: setting a perplexity threshold value, and when the calculated sentence currency degree score is smaller than the perplexity threshold value, marking the original word corresponding to the t-th replacement word as an error correction word; otherwise, the word is considered to be a correct use word;
s336: and traversing all words in the error correction candidate set to obtain an error correction preliminary screening text.
S400: and carrying out error correction processing on the error correction primary screening text by using a model error correction module, sequentially passing through a MacBERT4CSC model and a T5 model, and then outputting the bidding document with the error correction label by the T5 model.
Advantages of the MacBERT4CSC error correction model: any BERT model can be selected on the model network structure of the MacBERT4CSC, and the method is mainly characterized in that different MLM task designs are adopted during pre-training: selecting candidate tokens for masking by using a whole word masking (wwm) and an N-gram masking strategy; BERT-like models typically use [ MASK ] to MASK the original word, while MacBERT4CSC uses a third party synonym tool to generate a near-synonym for the target word for use in masking the original word, in particular, when the original word has no near-synonym, a random n-gram is used to MASK the original word; the method of use is similar to the BERT pre-training model, but the MacBERT4CSC model does not require training and is more suitable for use in chinese correction input where 80% of the words are replaced with near-sense words (formerly [ MASK ]), 10% of the words are replaced with random words, and 10% of the words are unchanged. The model-specific network structure facilitates the application of error correction scenarios.
The T5 model has the advantages that: the improved classical transform model, with both decoder and encoder, is more robust to processing a given input, yielding a corresponding output scene.
By utilizing the technical scheme, a corresponding operating system can be established, the system is fused by various algorithms and rules, the accuracy rate of the system can basically meet the requirement, and the specific operating process of the system is as follows: a user logs in an operating system at a browser end, uploads a bidding document to be verified, and clicks a verification button, so that the verification work can be automatically completed in the whole process; meanwhile, after the model verification is finished, the position with the error of the content, namely the error correction content, can be displayed by highlighting, and then a user clicks and downloads to obtain a file after the verification; the system supports online editing, and a user can further perform verification and editing according to the verification result of the model.
Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims (4)

1. A method for correcting the error of the text content of a bidding document is characterized by comprising the following steps: the method comprises the following steps:
s100: selecting a plurality of bidding documents;
s200: constructing a text content error correction model, wherein the text content error correction model comprises a rule error correction module and a model error correction module;
the rule error correction module comprises a dictionary rule, an editing distance and an n-element language model; wherein the dictionary rules are professional words and general grammar rules in the industry;
the model error correction module comprises a MacBERT4CSC model and a T5 model;
s300: carrying out error correction processing on the bidding document by utilizing the dictionary rule, the editing distance and the n-element language model to obtain an error correction primary screening text;
s400: and carrying out error correction processing on the error correction primary screening text by using a model error correction module, sequentially passing through a MacBERT4CSC model and a T5 model, and then outputting the bidding document with the error correction label by the T5 model.
2. A method of correcting errors in the textual contents of a bid document according to claim 1, characterized in that: in the S300, the step of performing error correction preliminary screening on the bid-marking document by using the dictionary rule and the editing distance to obtain an error correction preliminary screening text specifically includes the following steps:
s310: correcting errors of the bidding documents based on the dictionary rules, and labeling error correction contents to obtain a mapping dictionary;
s320: and carrying out error correction processing on the mapping dictionary based on the editing distance, and labeling error correction contents to obtain an error correction text, wherein the expression of the editing distance is as follows:
Figure FDA0003973011110000011
where a and b represent two strings, i represents the first i characters of a, j represents the first j characters of b, lev a,b (i, j) represents the column-Wenstein distance between the first i characters of a and the first j characters of b;
s330: and carrying out error correction processing on the error correction text based on the n-element language model to obtain an error correction primary screening text.
3. A method of correcting errors in the textual contents of a bid document according to claim 2, characterized in that: the specific steps of performing error correction processing on the error correction text based on the n-gram language model in S330 are as follows:
s331: setting a parameter value of a sliding window algorithm as n, and setting a score threshold value;
s332: performing n-gram calculation scoring on each word in the error correction text by adopting a sliding window algorithm and an n-gram language model, and performing average absolute deviation calculation processing on the scoring result of each word to obtain a deviation result score of each word;
s333: comparing the dispersion result score of each word with a score threshold, marking the words larger than the score threshold as suspected errors, and performing pronunciation and font replacement on the suspected errors through a dictionary rule to obtain an error correction candidate set;
s334: substituting the t-th replacement word in the error correction candidate set into the original position in the error correction text, wherein the original position in the error correction text is the position of the sentence before the words in the error correction candidate set are replaced, calculating the score of the sentence smoothness degree after the t-th replacement word is substituted by utilizing the PPL confusion degree, and calculating the expression as follows:
Figure FDA0003973011110000021
where PP () represents the degree of confusion, S represents a sentence, N represents the sentence length, P (i) represents the probability of the ith word, w i Representing the ith word in the sentence S;
s335: setting a confusion threshold, and marking the original word corresponding to the tth replacement word as an error correction word when the calculated sentence smoothness score is smaller than the confusion threshold; otherwise, the word is considered to be a correctly used word;
s336: and traversing all words in the error correction candidate set to obtain an error correction preliminary screening text.
4. A method of correcting errors in the textual contents of a bid document according to claim 3, characterized in that: the bid-up document selected in the S100 is in a word format.
CN202211525733.8A 2022-11-30 2022-11-30 Method for correcting text content of bidding document Pending CN115935964A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211525733.8A CN115935964A (en) 2022-11-30 2022-11-30 Method for correcting text content of bidding document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211525733.8A CN115935964A (en) 2022-11-30 2022-11-30 Method for correcting text content of bidding document

Publications (1)

Publication Number Publication Date
CN115935964A true CN115935964A (en) 2023-04-07

Family

ID=86652009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211525733.8A Pending CN115935964A (en) 2022-11-30 2022-11-30 Method for correcting text content of bidding document

Country Status (1)

Country Link
CN (1) CN115935964A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116257602A (en) * 2023-05-16 2023-06-13 北京拓普丰联信息科技股份有限公司 Method and device for constructing universal word stock based on public words and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116257602A (en) * 2023-05-16 2023-06-13 北京拓普丰联信息科技股份有限公司 Method and device for constructing universal word stock based on public words and electronic equipment
CN116257602B (en) * 2023-05-16 2023-07-07 北京拓普丰联信息科技股份有限公司 Method and device for constructing universal word stock based on public words and electronic equipment

Similar Documents

Publication Publication Date Title
CN110489760B (en) Text automatic correction method and device based on deep neural network
CN111310447B (en) Grammar error correction method, grammar error correction device, electronic equipment and storage medium
CN109918640B (en) Chinese text proofreading method based on knowledge graph
CN110276069B (en) Method, system and storage medium for automatically detecting Chinese braille error
CN110119510B (en) Relationship extraction method and device based on transfer dependency relationship and structure auxiliary word
CN114818891B (en) Small sample multi-label text classification model training method and text classification method
CN115357719B (en) Power audit text classification method and device based on improved BERT model
TWI567569B (en) Natural language processing systems, natural language processing methods, and natural language processing programs
CN111611810A (en) Polyphone pronunciation disambiguation device and method
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN115935964A (en) Method for correcting text content of bidding document
CN108664464B (en) Method and device for determining semantic relevance
CN114065738B (en) Chinese spelling error correction method based on multitask learning
Tan et al. Spelling error correction with BERT based on character-phonetic
CN110705217A (en) Wrongly-written character detection method and device, computer storage medium and electronic equipment
CN113988063A (en) Text error correction method, device and equipment and computer readable storage medium
CN110929514B (en) Text collation method, text collation apparatus, computer-readable storage medium, and electronic device
CN113822052A (en) Text error detection method and device, electronic equipment and storage medium
CN112257442A (en) Policy document information extraction method based on corpus expansion neural network
He et al. Application of Grammar Error Detection Method for English Composition Based on Machine Learning
CN114880994A (en) Text style conversion method and device from direct white text to ironic text
CN110750967B (en) Pronunciation labeling method and device, computer equipment and storage medium
CN114580391A (en) Chinese error detection model training method, device, equipment and storage medium
Seifossadat et al. Stochastic Data-to-Text Generation Using Syntactic Dependency Information
CN112597771A (en) Chinese text error correction method based on prefix tree combination

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination