CN115310434B - Error correction method and device for grammars of contracting documents, computer equipment and storage medium - Google Patents

Error correction method and device for grammars of contracting documents, computer equipment and storage medium Download PDF

Info

Publication number
CN115310434B
CN115310434B CN202211238213.9A CN202211238213A CN115310434B CN 115310434 B CN115310434 B CN 115310434B CN 202211238213 A CN202211238213 A CN 202211238213A CN 115310434 B CN115310434 B CN 115310434B
Authority
CN
China
Prior art keywords
error
contract
error correction
processed
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211238213.9A
Other languages
Chinese (zh)
Other versions
CN115310434A (en
Inventor
顾敏
杜向阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Qingdun Information Technology Co ltd
Original Assignee
Shenzhen Qingdun Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Qingdun Information Technology Co ltd filed Critical Shenzhen Qingdun Information Technology Co ltd
Priority to CN202211238213.9A priority Critical patent/CN115310434B/en
Publication of CN115310434A publication Critical patent/CN115310434A/en
Application granted granted Critical
Publication of CN115310434B publication Critical patent/CN115310434B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application relates to an error correction method, an error correction device, computer equipment and a storage medium for grammars of a contracting instrument, wherein the method comprises the steps of collecting contracting field data, and establishing a contracting instrument error correction data set based on the contracting field data; acquiring contract data to be processed, and identifying error sentences to be processed in the contract data to be processed; acquiring correct contract sentences and labeled contract sentences based on the contracting instrument error correction data set, and inputting the correct contract sentences, the labeled contract sentences and the to-be-processed error sentences into a preset model for training so as to identify error types corresponding to the to-be-processed error sentences; acquiring an error correction mode corresponding to the error type, and performing error correction processing on the error sentence to be processed in the error correction mode to obtain a correct contracting instrument; and (5) crawling paraphrases of each word in the correct contract document to generate an error-correcting paraphrase knowledge base. The method and the device realize error correction of the grammars of the contracting documents with different error types, and are beneficial to improving the error correction accuracy of the grammars of the contracting documents.

Description

Error correction method and device for grammars of contracting documents, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for error correction of a grammatical grammar of a contracting instrument, a computer device, and a storage medium.
Background
With the increasing economic level, the number of enterprise contracts is increasing day by day, and the grammar correction task is heavier and heavier. The traditional manual error correction has long period and high labor intensity, and is low in efficiency, and the rapid and efficient automatic grammatical error correction of the contract document is an urgent task to be solved in the contract examination. The Chinese grammar error correction task aims to automatically recognize and correct grammar errors in texts by utilizing natural language processing technology. Grammar errors include error types such as missing, redundant, wrongly written characters, misspoken words, misword order, and the like. For the contract text, the error correction is biased towards a risk prompt, and the error correction of the contract element content and the error correction of the organization name in the dispute resolution item are more concerned. If wrongly written or misworded characters appear in the elements, the contract may be disputed.
Currently, syntax error correction mainly adopts two types of methods: the first method is that firstly, the error type is identified, and then the error is corrected in a targeted manner aiming at the error type; the other method uses the idea of machine translation to equate language correction to the process of machine translation, i.e. translation of wrong text into correct text. However, the syntax error correction of the contracting documents is different from the general syntax error correction, and more legal expertise is required to assist the error correction task. In the contract grammar error correction task, the correction and marking difficulty of error data is high, and people with legal knowledge can accurately mark the error data, so that a large amount of marked data is difficult to obtain in the contract grammar error correction task. Therefore, the grammar error correction model based on the neural network model is difficult to be trained sufficiently, and high-efficiency information characteristics cannot be acquired, so that the error correction accuracy of the grammar of the contract document is low.
Disclosure of Invention
The embodiment of the application aims to provide a method and a device for correcting the grammar of the contracting instrument, a computer device and a storage medium, so as to improve the accuracy of correcting the grammar of the contracting instrument.
In order to solve the above technical problem, an embodiment of the present application provides a method for correcting an error of a syntax of a contracting instrument, including:
collecting contract field data, and creating a contracting instrument error correction data set based on the contract field data, wherein the contracting instrument error correction data set comprises an unknown word bank, an obfuscated data set and a contract error correction data set;
acquiring contract data to be processed, and identifying error sentences to be processed in the contract data to be processed;
acquiring a correct contract statement and a labeling contract statement based on the contracting instrument error correction data set, and inputting the correct contract statement, the labeling contract statement and the to-be-processed error statement into a preset model for training so as to identify an error type corresponding to the to-be-processed error statement;
acquiring an error correction mode corresponding to the error type, and performing error correction processing on the error sentence to be processed through the error correction mode to obtain a correct contracting instrument;
and (5) crawling paraphrases of each word in the correct contract document to generate an error-correcting paraphrase knowledge base.
In order to solve the above technical problem, an embodiment of the present application provides an error correction apparatus for a grammatical statement of a contracting document, including:
the system comprises a contracting instrument error correction data set creating module, a contracting instrument error correction data set creating module and a contracting instrument error correction data set creating module, wherein the contracting instrument error correction data set is used for collecting contracting field data and creating a contracting instrument error correction data set based on the contracting field data, and comprises an unknown word bank, an confusion data set and a contracting error correction data set;
the system comprises a to-be-processed error sentence identification module, a to-be-processed error sentence identification module and a to-be-processed error sentence identification module, wherein the to-be-processed error sentence identification module is used for acquiring to-be-processed contract data and identifying a to-be-processed error sentence in the to-be-processed contract data;
the error type identification module is used for acquiring a correct contract statement and a labeled contract statement based on the contracting instrument error correction data set, and inputting the correct contract statement, the labeled contract statement and the to-be-processed error statement into a preset model for training so as to identify an error type corresponding to the to-be-processed error statement;
the correct contracting instrument generating module is used for acquiring an error correction mode corresponding to the error type and carrying out error correction processing on the error sentence to be processed in the error correction mode to obtain a correct contracting instrument;
and the error correction paraphrase knowledge base generation module is used for crawling paraphrases of all words in the correct contract document to generate an error correction paraphrase knowledge base.
In order to solve the technical problems, the invention adopts a technical scheme that: a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of error correction of a contracting instrument grammar as in any one of the above.
The embodiment of the invention provides a method and a device for correcting the grammar of a contracting instrument, computer equipment and a storage medium. The method comprises the following steps: collecting contract field data, and creating a contract document error correction data set based on the contract field data, wherein the contract document error correction data set comprises an unknown word bank, an confusion data set and a contract error correction data set; acquiring contract data to be processed, and identifying error sentences to be processed in the contract data to be processed; acquiring correct contract sentences and labeled contract sentences based on the contracting instrument error correction data set, and inputting the correct contract sentences, the labeled contract sentences and the to-be-processed error sentences into a preset model for training so as to identify error types corresponding to the to-be-processed error sentences; acquiring an error correction mode corresponding to the error type, and performing error correction processing on the error sentence to be processed in the error correction mode to obtain a correct contracting instrument; and (5) crawling paraphrases of each word in the correct contract document to generate an error-correcting paraphrase knowledge base. In the embodiment of the invention, the error correction data set of the contracting instrument is created to meet the requirements of different contract types; and then, identifying the error sentence to be processed by taking the unknown word stock as a word segmentation word stock, confirming the error type corresponding to the sentence to be processed, performing error correction processing on the error sentence to be processed according to different error types, and crawling the paraphrase of each word, thereby realizing error correction on the grammars of the contracting documents with different error types and being beneficial to improving the error correction accuracy of the grammars of the contracting documents.
Drawings
In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is a flow chart of an implementation of a method for error correction of a grammars of a contracting instrument provided by an embodiment of the present application;
FIG. 2 is a flowchart of another implementation of a sub-process of the error correction method for the syntax of the contracting instrument provided in the embodiment of the present application;
FIG. 3 is a flowchart of another implementation of a sub-process of the error correction method for the syntax of the contracting instrument provided in the embodiment of the present application;
FIG. 4 is a flowchart of another implementation of a sub-process of the error correction method for the syntax of the contracting instrument provided in the embodiment of the present application;
FIG. 5 is a flowchart of another implementation of a sub-process of the error correction method for the syntax of the contracting instrument provided in the embodiment of the present application;
FIG. 6 is a flow chart of another implementation of a sub-process of the error correction method for the syntax of the contracting instrument provided in the embodiment of the present application;
FIG. 7 is a flowchart of another implementation of a sub-process of the error correction method for the syntax of the contracting instrument provided in the embodiment of the present application;
FIG. 8 is a schematic diagram of an error correction device for the syntax of the contracting instrument provided by the embodiment of the present application;
fig. 9 is a schematic diagram of a computer device provided in an embodiment of the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
The present invention will be described in detail below with reference to the accompanying drawings and embodiments.
It should be noted that the error correction method for the contracting instrument grammar provided in the embodiment of the present application is generally executed by a server, and accordingly, an error correction device for the contracting instrument grammar is generally configured in the server.
Referring to fig. 1, fig. 1 shows an embodiment of an error correction method for the syntax of a contracting instrument.
It should be noted that, if the result is substantially the same, the method of the present invention is not limited to the flow sequence shown in fig. 1, and the method includes the following steps:
s1: collecting contract field data and creating a contracting instrument error correction data set based on the contract field data, wherein the contracting instrument error correction data set comprises an unknown word bank, an obfuscated data set and a contract error correction data set.
Specifically, because the contracting instrument corpus is long in length, difficult to collect, various in content, and there is no published contracting instrument error correction dataset at present, the embodiment of the present application generates the contracting instrument error correction dataset by collecting the contracting field data and processing the contracting field data. The contracting instrument error correction data set comprises an unknown word bank, an obfuscated data set and a contracting error correction data set.
Referring to fig. 2, fig. 2 shows an embodiment of step S1, which is described in detail as follows:
s11: and acquiring contract field data, and combining the characters in the contract field data in pairs to obtain candidate words.
S12: and constructing a prefix dictionary tree and a suffix dictionary tree of the candidate words, wherein the prefix dictionary tree and the suffix dictionary tree take a single character as a node, and each node records the occurrence frequency of the words formed from the root node to the current node.
S13: and acquiring frequency lists of the prefix dictionary tree and the suffix dictionary tree, and calculating left and right information entropies of each candidate word and candidate word forming segments.
S14: and screening the unknown words from the data in the contract field based on the left-right information entropy and the frequency list to obtain an unknown word library.
S15: an obfuscated data set of contract field data and a contract error correction data set are constructed.
Specifically, when a contracting instrument error correction data set is constructed, word segmentation of data in the same field may cause some unknown words to be segmented incorrectly, so that errors occur in subsequent error correction. In the embodiment of the application, characters in contract field data are combined pairwise to serve as candidate words, and then a prefix dictionary tree and a suffix dictionary tree are constructed through a 3-gram sequence, wherein the prefix dictionary tree and the suffix dictionary tree use single characters as nodes, and each node records the occurrence frequency of words formed from a root node to a current node; acquiring frequency lists of a prefix dictionary tree and a suffix dictionary tree and calculating left and right information entropies of each candidate word and candidate word forming segments as the nodes record the occurrence frequency of the words formed from the root node to the current node; and screening the left and right information entropies, and screening out candidate words with the left and right information entropies being zero as unknown words to be confirmed. And then calculating the product of the word frequency and the fraction in each unknown word to be confirmed to obtain a product result, sequencing according to the product result, and selecting a preset number of candidate words to obtain an unknown word library. For example. For example, the screened unknown words are: unemployment insurance gold, joint operation share system, national land use right giving, chinese post, professional post operator and air transportation bill of lading. The left and right information entropy refers to the entropy of the left boundary and the entropy of the right boundary of the multi-word expression.
Referring to fig. 3, fig. 3 shows an embodiment of step S15, which is described in detail as follows:
s151: and returning the contract field data to the development end so that the development end performs marking and proofreading on the contract field data to obtain an obfuscated data set.
S152: and identifying error sentences in the contract field data and error types corresponding to the error sentences.
S153: and carrying out error correction processing on the error sentences according to the error types and the confusion data sets to obtain initial data sets, wherein the initial data sets comprise a plurality of contracting instrument data sets.
S154: and calculating the confusion degree of each contract document data set in the initial data set, and screening the contract document data sets based on the confusion degree to obtain a contract error correction data set.
Specifically, part of data in the contract field data is selected and returned to the development end, and in the development end, the part of data is labeled and corrected by a legal engineer to obtain an obfuscated data set. Then, entity positions of the contract field data are filtered, and error sentences in the contract data and corresponding error types are identified; if the redundancy type is wrong, randomly selecting 10% of words in the confusion data set to replace the original contract text, or randomly adding 10% of characters or words in the sentence; if the missing type is wrong, randomly selecting 10% of words in the confusion data set to replace the original contract text, or randomly deleting 10% of characters or words in the sentence to obtain an initial data set. And finally, calculating the confusion degree of each contract document data set in the initial data set, screening the contract document data sets based on the confusion degree, and screening out initial data with low confusion degree, thereby obtaining a contract error correction data set.
S2: and acquiring contract data to be processed, and identifying error sentences to be processed in the contract data to be processed.
Specifically, when the contract data to be processed needs to be corrected, the contract data to be processed is obtained, and in order to avoid subsequent errors caused by word segmentation errors, word segmentation processing and entity identification processing are performed on the contract data to be processed, so that a wrong sentence to be processed is predicted.
Referring to fig. 4, fig. 4 shows an embodiment of step S2, which is described in detail as follows:
s21: and acquiring contract data to be processed.
S22: and performing word segmentation processing and entity identification processing on the contract data to be processed, and judging whether the contract data to be processed has wrong words or not.
S23: and if the error words exist, counting the error word frequency corresponding to the error words, and judging whether the error words belong to the unknown words in the unknown word library.
S24: and if the error word belongs to the unknown word and the frequency of the error word exceeds a preset threshold value, acquiring a statement corresponding to the error word as a to-be-processed error statement.
Specifically, performing word segmentation processing and entity identification processing on contract data to be processed, judging whether the position of each word is wrong, if so, counting the word frequency of the word at the position, judging whether the word at the position is an unknown word, and if the wrong word belongs to the unknown word and the wrong word frequency exceeds a preset threshold value, acquiring a statement corresponding to the wrong word as a wrong statement to be processed. The preset threshold is set according to actual conditions, and is not limited herein.
S3: and acquiring correct contract sentences and labeled contract sentences based on the contracting instrument error correction data set, and inputting the correct contract sentences, the labeled contract sentences and the to-be-processed error sentences into a preset model for training so as to identify the error types corresponding to the to-be-processed error sentences.
Referring to fig. 5, fig. 5 shows an embodiment of step S3, which is described in detail as follows:
s31: and acquiring correct contract sentences and annotated contract sentences based on the contracting instrument error correction data set.
S32: and inputting the correct contract statement, the labeled contract statement and the to-be-processed error statement into the serialized label model for training so as to identify the error type label sequence corresponding to the to-be-processed error statement.
S33: and generating an error type corresponding to the error statement to be processed based on the error type label sequence.
Specifically, the contracting instrument error correction data set comprises an unknown word bank, an obfuscated data set and a contracting error correction data set, and the embodiment of the application acquires correct contract sentences and labeled contract sentences from the obfuscated data set and the contracting error correction data set. Regarding the prediction of error types as a sequence labeling problem, labeling the normal words, the error words and the corresponding positions thereof, for example, labeling the correct sentences as O, the starting positions of the error sentences as B-X, and the intermediate positions and the ending positions as I-X, wherein X represents the error types, and the error types are respectively B-R (redundant), I-R, B-M (missing), I-M, B-W (wrongly written characters), and I-W.
For example, if the error sentence to be processed is "principal vector", and the correct contract sentence is "annotation vector", the input of the error recognition part is "principal vector", and the output of the model is "O B _ W O" sequence, which indicates that the second character is a wrongly written or mispronounced character.
In the process of inputting correct contract sentences, labeled contract sentences and to-be-processed error sentences into a serialized label model for training, vector representation of the to-be-processed error sentences is obtained based on a pre-training model BERT, context representation of the to-be-processed error sentences is coded by using a bidirectional long-short term memory network, and finally error type label sequences corresponding to the to-be-processed error sentences are predicted in a conditional random field layer. For a statement to be processed:
Figure 553600DEST_PATH_IMAGE001
a vector representation of the sentence is obtained based on the pre-trained model BERT, i.e. expressed as:
Figure 267479DEST_PATH_IMAGE002
. In the embodiment of the application, in the training process of the bidirectional long-short term memory network, the negative log-likelihood function is used as a loss function for training, and an Adam optimizer is used for optimization processing.
S4: and acquiring an error correction mode corresponding to the error type, and performing error correction processing on the error sentence to be processed in the error correction mode to obtain the correct contracting instrument.
Specifically, the above steps have identified the error sentence to be processed and the corresponding error type, and the embodiments of the present application adopt different error correction methods for different error types. Therefore, the error correction mode corresponding to the error type is obtained first, and the error correction processing is carried out on the error sentence to be processed in the error correction mode to obtain the correct contracting instrument.
Referring to fig. 6, fig. 6 shows an embodiment of step S4, which is described in detail as follows:
s41: and if the error type is a missing error, acquiring a filling word in the pre-training language model, and performing error correction processing on the error sentence to be processed by the filling word to obtain a correct contracting document.
Referring to fig. 7, fig. 7 shows an embodiment of step S41, which is described in detail as follows:
s411: and if the error type is a missing error, acquiring a contract professional file, and finely adjusting the pre-training language model through the contract professional file.
S412: predicting a set of filling terms in the pre-training language model through a cluster search algorithm, wherein the filling terms comprise a plurality of filling terms.
S413: and identifying a missing position corresponding to the error sentence to be processed, and filling the missing position through a plurality of filling word exceptions to obtain a filling data set, wherein the filling data set comprises a plurality of filling data.
S414: and calculating the confusion degree of the filling data in the filling data set to obtain a target confusion degree, and screening the filling data based on the target confusion degree to obtain a correct contract document.
Specifically, if the error type is a missing error, acquiring a contract professional file, and finely adjusting the pre-training language model through the contract professional file, wherein the contract professional file comprises a contract document, a contract law, a contract-related referee document and the like. And predicting a filling word set in the pre-training language model by using a cluster search algorithm, wherein the filling words comprise a plurality of filling words, identifying a missing position corresponding to the error sentence to be processed, inserting a preset character such as 0-3 (mask) characters into the missing position, filling the missing position by using a plurality of filling words which are abnormal to obtain a filling data set, calculating the confusion degree of the filling data in the filling data set to obtain a target confusion degree, and screening the filling data based on the target confusion degree to obtain a correct contract document. And if the confusion degree of the completed sentence is lower than that of the original sentence, judging that the correction of the filling data is correct, and adding the filling data into a correct contract document. Under the condition that the solution space of the graph is large, in order to reduce the space and time occupied by searching, some nodes with poor quality are cut off and some nodes with high quality are reserved when the depth of each step is expanded.
S42: and if the error type is a redundancy error, identifying a starting error position and an ending error position of the error sentence to be processed, and deleting corresponding characters from the starting error position to the ending error position to obtain the correct contracting instrument.
S43: and if the error type is the mechanism name error, constructing a target prefix dictionary tree through the legal knowledge map, screening out candidates according to the editing distance in the target prefix dictionary tree, and performing error correction processing on the error sentences to be processed through the candidates to obtain the correct contract documents.
Specifically, if the error type is a redundancy error, identifying a starting error position and an ending error position of an error sentence to be processed, and deleting corresponding characters from the starting error position to the ending error position to obtain a correct contract document; and if the error type is the mechanism name error, constructing a target prefix dictionary tree through the legal knowledge map, screening out candidate entities according to the editing distance in the target prefix dictionary tree, and carrying out error correction processing on the error sentences to be processed through the candidate entities to obtain the correct contract documents. And recommending corresponding mechanisms according to the recognized provinces and cities for administrative division collocation errors. There are four cases for the organization name error: (1) general error: wrong, few, many. For example: because of dispute in the contract, the two parties can negotiate and resolve; after the negotiation, any party can apply arbitration to the arbitration committee of Beijing city, and after error correction, the method comprises the following steps: the arbitration committee of beijing. (2) High, medium and low level courts are confused to generate errors. For example: settlement of disputes: disputes of the contract in the process of fulfillment are negotiated and resolved by the parties of the two parties; and if the two parties can not negotiate, both parties agree to apply litigation to the advanced people court of Dongguan city, and after error correction, the following steps are carried out: the Dongguan city middle-grade people court. (3) administrative division common sense errors: administrative division collocation errors. For example: the twelfth item: disputes occur in the contract, which are governed by the lawsuits of people's court in the rain flower area of Nanjing, anhui province, and the error correction is as follows: the rainflower platform area of Nanjing city, jiangsu province, for example: all disputes caused by or related to the contract are complained to the Baiyun region national institute of Ministry of Shenzhen city in untimely negotiations, and the error correction is as follows: the national institute of people in the cloud area of Guangzhou city or the national institute of people in the cloud area of Guiyang city. (4) arbitration mechanism name error: a non-existent arbitration mechanism name. For example: the disputes in the contract fulfillment process are negotiated and resolved by the parties friendly to both parties, can be mediated by a third person, can be negotiated or not, can be arbitrated with the international arbitration center in Beijing City, and are as follows after error correction: beijing International arbitration center.
S5: and (5) crawling paraphrases of each word in the correct contract document to generate an error-correcting paraphrase knowledge base.
Specifically, according to the conclusion set of the contracting documents, paraphrases or hundred-degree encyclopedia paraphrases in a Chinese dictionary corresponding to each word in the correct contracting documents are crawled to generate an error-correcting paraphrase knowledge base.
In the embodiment, contract field data are collected, and a contract document error correction data set is created based on the contract field data, wherein the contract document error correction data set comprises an unknown word bank, an obfuscated data set and a contract error correction data set; acquiring contract data to be processed, and identifying error sentences to be processed in the contract data to be processed; acquiring correct contract sentences and labeled contract sentences based on the contracting instrument error correction data set, and inputting the correct contract sentences, the labeled contract sentences and the to-be-processed error sentences into a preset model for training so as to identify error types corresponding to the to-be-processed error sentences; acquiring an error correction mode corresponding to the error type, and performing error correction processing on the error sentence to be processed in the error correction mode to obtain a correct contracting instrument; and (5) crawling paraphrases of each word in the correct contract document to generate an error-correcting paraphrase knowledge base. In the embodiment of the invention, the error correction data set of the contracting instrument is created to meet the requirements of different contract types; and then, the unknown word bank is used as a word segmentation word bank, the error sentence to be processed is identified, the error type corresponding to the sentence to be processed is confirmed, error correction processing is carried out on the error sentence to be processed according to different error types, and the paraphrase of each word is crawled, so that the error correction of the grammars of the contracting documents with different error types is realized, and the error correction accuracy of the grammars of the contracting documents is improved.
Referring to fig. 7, as an implementation of the method shown in fig. 1, the present application provides an embodiment of an error correction apparatus for a syntax of a contracting instrument, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1, and the apparatus can be applied to various electronic devices.
As shown in fig. 7, the error correction device of the contracting instrument grammar of the present embodiment includes: a contracting instrument error correction data set creating module 61, a pending error sentence recognition module 62, an error type recognition module 63, a correct contracting instrument generating module 64 and an error correction paraphrase knowledge base generating module 65, wherein:
the contracting instrument error correction data set creating module 61 is used for collecting contracting field data and creating a contracting instrument error correction data set based on the contracting field data, wherein the contracting instrument error correction data set comprises an unknown word bank, an confusion data set and a contracting error correction data set;
a to-be-processed error sentence identification module 62, configured to obtain to-be-processed contract data, and identify to-be-processed error sentences in the to-be-processed contract data;
the error type identification module 63 is configured to obtain a correct contract statement and a labeled contract statement based on the contracting instrument error correction data set, and input the correct contract statement, the labeled contract statement and the to-be-processed error statement into a preset model for training to identify an error type corresponding to the to-be-processed error statement;
a correct contracting instrument generating module 64, configured to obtain an error correction mode corresponding to the error type, and perform error correction processing on the to-be-processed error sentence in the error correction mode to obtain a correct contracting instrument;
the error correction paraphrase knowledge base generation module 65 is used for crawling paraphrases of each word in the correct contract document to generate an error correction paraphrase knowledge base.
Further, the contracting instrument error correction data set creating module 61 includes:
the candidate word generating unit is used for acquiring contract field data and combining the characters in the contract field data in pairs to obtain candidate words;
the dictionary tree construction unit is used for constructing a prefix dictionary tree and a suffix dictionary tree of the candidate words, wherein the prefix dictionary tree and the suffix dictionary tree take single characters as nodes, and each node records the occurrence frequency of the words formed from the root node to the current node;
the information entropy calculation unit is used for acquiring frequency lists of the prefix dictionary tree and the suffix dictionary tree and calculating left and right information entropy of each candidate word and a segment formed by the candidate words;
the unknown word screening unit is used for screening the unknown words from the data in the contract field based on the left and right information entropy and the frequency list to obtain an unknown word library;
and the data set construction unit is used for constructing an confusion data set and a contract error correction data set of the contract field data.
Further, the data set construction unit includes:
the confusion data set generating subunit is used for returning the contract field data to the development end so that the development end performs marking and proofreading on the contract field data to obtain a confusion data set;
the error sentence identification subunit is used for identifying error sentences in the contract field data and error types corresponding to the error sentences;
the initial data set generating subunit is used for carrying out error correction processing on the error sentences according to the error types and the confusion data sets to obtain initial data sets, wherein the initial data sets comprise a plurality of contracting instrument data sets;
and the contracting instrument data set screening subunit is used for calculating the confusion degree of each contracting instrument data set in the initial data set and screening the contracting instrument data sets based on the confusion degree to obtain a contracting error correction data set.
Further, the to-be-processed incorrect sentence recognition module 62 includes:
the contract data to be processed acquiring unit is used for acquiring contract data to be processed;
the wrong word judgment unit is used for performing word segmentation processing and entity identification processing on the contract data to be processed and judging whether the wrong words exist in the contract data to be processed or not;
the error word frequency counting unit is used for counting error word frequencies corresponding to error words if the error words exist, and judging whether the error words belong to the unknown words in the unknown word library;
and the to-be-processed error sentence judging unit is used for acquiring a sentence corresponding to the error word as the to-be-processed error sentence if the error word belongs to the unregistered word and the frequency of the error word exceeds a preset threshold value.
Further, the error type identification module 63 includes:
the annotation sequence acquisition unit is used for acquiring correct contract sentences and annotation contract sentences based on the error correction data set of the contracting instrument;
the error type tag sequence generating unit is used for inputting the correct contract statement, the labeled contract statement and the error statement to be processed into the serialized label model for training so as to identify the error type tag sequence corresponding to the error statement to be processed;
and the error type identification unit is used for generating an error type corresponding to the error statement to be processed based on the error type label sequence.
Further, the correct contracting instrument generating module 64 includes:
the first error correction unit is used for acquiring filling words in the pre-training language model if the error type is a missing error, and performing error correction processing on the error sentences to be processed through the filling words to obtain correct contracting documents;
the second error correction unit is used for identifying the initial error position and the end error position of the error sentence to be processed if the error type is a redundancy error, and deleting the corresponding characters from the initial error position to the end error position to obtain a correct contract document;
and the third error correction unit is used for constructing a target prefix dictionary tree through the legal knowledge map if the error type is the mechanism name error, screening out candidate entities according to the editing distance in the target prefix dictionary tree, and performing error correction processing on the error sentences to be processed through the candidate entities to obtain correct contracting documents.
Further, the first error correction unit includes:
the pre-training language model fine-tuning subunit is used for acquiring a contract professional file if the error type is a missing error, and fine-tuning the pre-training language model through the contract professional file;
the device comprises a filling word set generating subunit, a clustering search algorithm generating subunit and a clustering search unit, wherein the filling word set generating subunit is used for predicting a filling word set in a pre-training language model through a clustering search algorithm, and the filling words comprise a plurality of filling words;
the device comprises a filling data set generating subunit, a processing unit and a processing unit, wherein the filling data set generating subunit is used for identifying a missing position corresponding to an error sentence to be processed, and filling the missing position through a plurality of filling word exceptions to obtain a filling data set, and the filling data set comprises a plurality of filling data;
and the target confusion degree operator unit is used for calculating the confusion degree of the filling data in the filling data set to obtain the target confusion degree, and screening the filling data based on the target confusion degree to obtain the correct contract document.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 8, fig. 8 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 7 comprises a memory 71, a processor 72, a network interface 73, communicatively connected to each other by a system bus. It is noted that only a computer device 7 having three components memory 71, processor 72, network interface 73 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 71 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 71 may be an internal storage unit of the computer device 7, such as a hard disk or a memory of the computer device 7. In other embodiments, the memory 71 may also be an external storage device of the computer device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device 7. Of course, the memory 71 may also comprise both an internal storage unit of the computer device 7 and an external storage device thereof. In the present embodiment, the memory 71 is generally used to store an operating system installed in the computer device 7 and various types of application software, such as program codes of an error correction method for the syntax of a contracting instrument. Further, the memory 71 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 72 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 72 is typically used to control the overall operation of the computer device 7. In this embodiment, the processor 72 is configured to run the program code stored in the memory 71 or process data, such as the program code for running the error correction method of the contracting instrument grammar described above, to implement various embodiments of the error correction method of the contracting instrument grammar.
The network interface 73 may comprise a wireless network interface or a wired network interface, and the network interface 73 is typically used to establish a communication connection between the computer device 7 and other electronic devices.
The present application further provides another embodiment, which is to provide a computer-readable storage medium storing a computer program, which is executable by at least one processor to cause the at least one processor to perform the steps of a method for error correction of a contracting instrument grammar as described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method of the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and the embodiments are provided so that this disclosure will be thorough and complete. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that modifications can be made to the embodiments described in the foregoing detailed description, or equivalents can be substituted for some of the features described therein. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (9)

1. A method for error correction of a contracting instrument grammar, comprising:
collecting contract field data, and creating a contracting instrument error correction data set based on the contract field data, wherein the contracting instrument error correction data set comprises an unknown word bank, an obfuscated data set and a contract error correction data set;
acquiring contract data to be processed, and identifying error sentences to be processed in the contract data to be processed;
acquiring a correct contract statement and a labeling contract statement based on the contracting instrument error correction data set, and inputting the correct contract statement, the labeling contract statement and the to-be-processed error statement into a preset model for training so as to identify an error type corresponding to the to-be-processed error statement;
acquiring an error correction mode corresponding to the error type, and performing error correction processing on the error sentence to be processed through the error correction mode to obtain a correct contracting instrument;
crawling paraphrases of each word in the correct contract document to generate an error-correcting paraphrase knowledge base;
wherein the collecting contract field data and creating a contracting instrument error correction data set based on the contract field data comprises:
acquiring the contract field data, and combining the characters in the contract field data in pairs to obtain candidate words;
constructing a prefix dictionary tree and a suffix dictionary tree of the candidate words, wherein the prefix dictionary tree and the suffix dictionary tree take single characters as nodes, and each node records the occurrence frequency of the words formed from a root node to a current node;
acquiring frequency lists of the prefix dictionary tree and the suffix dictionary tree, and calculating left and right information entropies of each candidate word and a segment formed by the candidate words;
screening unknown words from the contract field data based on the left and right information entropies and the frequency list to obtain the unknown word library;
constructing the confusion dataset and the contract error correction dataset of the contract field data.
2. The method of error correction of contracting instrument grammar of claim 1, wherein said constructing said confusion data set and said contract error correction data set of said contract field data comprises:
returning the contract field data to a development end so that the development end performs marking and proofreading on the contract field data to obtain the confusion data set;
identifying error sentences in contract field data and error types corresponding to the error sentences;
carrying out error correction processing on the error sentences according to the error types and the confusion data sets to obtain initial data sets, wherein the initial data sets comprise a plurality of contracting instrument data sets;
and calculating the confusion degree of each contracting instrument data set in the initial data set, and screening the contracting instrument data sets based on the confusion degree to obtain the contract error correction data set.
3. The method for correcting errors in the grammars of contracting instruments according to claim 1, wherein the acquiring contract data to be processed and identifying the error sentence to be processed in the contract data to be processed comprises:
acquiring the contract data to be processed;
performing word segmentation processing and entity identification processing on the contract data to be processed, and judging whether error words exist in the contract data to be processed;
if the wrong words exist, counting the frequency of the wrong words corresponding to the wrong words, and judging whether the wrong words belong to the unknown words in the unknown word library;
and if the error word belongs to the unknown word and the frequency of the error word exceeds a preset threshold value, obtaining a statement corresponding to the error word as the to-be-processed error statement.
4. The method for correcting error in contracting instrument grammar according to claim 1, wherein the step of obtaining a correct contract statement and a labeled contract statement based on the contracting instrument error correction dataset, and inputting the correct contract statement, the labeled contract statement and the to-be-processed error statement into a preset model for training to identify an error type corresponding to the to-be-processed error statement comprises:
acquiring the correct contract sentences and the labeled contract sentences based on the contracting instrument error correction data set;
inputting the correct contract statement, the labeled contract statement and the to-be-processed error statement into a serialized label model for training so as to identify an error type label sequence corresponding to the to-be-processed error statement;
and generating an error type corresponding to the error statement to be processed based on the error type label sequence.
5. The method according to any one of claims 1 to 4, wherein the error types include a missing error, a redundant error and a mechanism name error, the obtaining of the error correction mode corresponding to the error type performs error correction processing on the error sentence to be processed by the error correction mode to obtain a correct contracting instrument, and the method comprises:
if the error type is the missing error, acquiring a filling word in a pre-training language model, and performing error correction processing on the error sentence to be processed through the filling word to obtain the correct contract document;
if the error type is the redundancy error, identifying a starting error position and an ending error position of the error sentence to be processed, and deleting corresponding characters from the starting error position to the ending error position to obtain the correct contract document;
and if the error type is the mechanism name error, constructing a target prefix dictionary tree through a legal knowledge map, screening out a candidate entity according to the editing distance in the target prefix dictionary tree, and carrying out error correction processing on the error sentence to be processed through the candidate entity to obtain the correct contract document.
6. The method according to claim 5, wherein if the error type is the missing error, obtaining a filling word in a pre-training language model, and performing error correction processing on the error sentence to be processed by using the filling word to obtain the correct contracting instrument, the method comprises:
if the error type is the missing error, acquiring a contract professional file, and finely adjusting the pre-training language model through the contract professional file;
predicting a set of padding terms in the pre-trained language model by a cluster search algorithm, wherein the set of padding terms includes a plurality of the padding terms;
identifying a missing position corresponding to the error sentence to be processed, and filling the missing position through a plurality of filling words to obtain a filling data set, wherein the filling data set comprises a plurality of filling data;
and calculating the confusion degree of the filling data in the filling data set to obtain a target confusion degree, and screening the filling data based on the target confusion degree to obtain the correct contract document.
7. An apparatus for correcting errors in the syntax of a contracting instrument, comprising:
the system comprises a contracting instrument error correction data set creating module, a contracting instrument error correction data set creating module and a contracting instrument error correction data set creating module, wherein the contracting instrument error correction data set is used for collecting contracting field data and creating a contracting instrument error correction data set based on the contracting field data, and comprises an unknown word bank, an confusion data set and a contracting error correction data set;
the system comprises a to-be-processed error sentence identification module, a to-be-processed error sentence identification module and a to-be-processed error sentence identification module, wherein the to-be-processed error sentence identification module is used for acquiring to-be-processed contract data and identifying a to-be-processed error sentence in the to-be-processed contract data;
the error type identification module is used for acquiring a correct contract statement and a marked contract statement based on the contracting instrument error correction data set, and inputting the correct contract statement, the marked contract statement and the to-be-processed error statement into a preset model for training so as to identify an error type corresponding to the to-be-processed error statement;
the correct contracting instrument generating module is used for acquiring an error correction mode corresponding to the error type and carrying out error correction processing on the error sentence to be processed in the error correction mode to obtain a correct contracting instrument;
the error correction paraphrase knowledge base generation module is used for crawling paraphrases of all words in the correct contract document to generate an error correction paraphrase knowledge base;
wherein, the contracting instrument error correction data set creating module comprises:
the candidate word generating unit is used for acquiring the contract field data and combining the characters in the contract field data in pairs to obtain candidate words;
the dictionary tree construction unit is used for constructing a prefix dictionary tree and a suffix dictionary tree of the candidate words, wherein the prefix dictionary tree and the suffix dictionary tree take single characters as nodes, and each node records the occurrence frequency of the words formed from a root node to a current node;
the information entropy calculation unit is used for acquiring frequency lists of the prefix dictionary tree and the suffix dictionary tree and calculating left and right information entropies of each candidate word and a segment formed by the candidate words;
the unknown word screening unit is used for screening the unknown words from the contract field data based on the left and right information entropies and the frequency list to obtain the unknown word library;
a data set construction unit, configured to construct the confusion data set and the contract error correction data set of the contract field data.
8. A computer device, characterized by comprising a memory in which a computer program is stored and a processor that implements a method of error correction of a contracting instrument grammar as claimed in any one of claims 1 to 6 when the computer program is executed by the processor.
9. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements the method of error correction of the contracting instrument grammar of any one of claims 1 to 6.
CN202211238213.9A 2022-10-11 2022-10-11 Error correction method and device for grammars of contracting documents, computer equipment and storage medium Active CN115310434B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211238213.9A CN115310434B (en) 2022-10-11 2022-10-11 Error correction method and device for grammars of contracting documents, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211238213.9A CN115310434B (en) 2022-10-11 2022-10-11 Error correction method and device for grammars of contracting documents, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115310434A CN115310434A (en) 2022-11-08
CN115310434B true CN115310434B (en) 2023-01-06

Family

ID=83868204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211238213.9A Active CN115310434B (en) 2022-10-11 2022-10-11 Error correction method and device for grammars of contracting documents, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115310434B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022095563A1 (en) * 2020-11-06 2022-05-12 北京世纪好未来教育科技有限公司 Text error correction adaptation method and apparatus, and electronic device, and storage medium
CN114742037A (en) * 2020-12-23 2022-07-12 广州视源电子科技股份有限公司 Text error correction method and device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149406B (en) * 2020-09-25 2023-09-08 中国电子科技集团公司第十五研究所 Chinese text error correction method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022095563A1 (en) * 2020-11-06 2022-05-12 北京世纪好未来教育科技有限公司 Text error correction adaptation method and apparatus, and electronic device, and storage medium
CN114742037A (en) * 2020-12-23 2022-07-12 广州视源电子科技股份有限公司 Text error correction method and device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Detection and Correction of Real-word Errors in Bangla Language;Md. Mashod Rana et al.;《International Conference on Bangla Speech and Language Processing(ICBSLP)》;20180921;第1-5页 *
基于深度学习算法的英语语法纠错系统设计;田静 等;《自动化与仪器仪表》;20220930(第9期);第128-131页 *

Also Published As

Publication number Publication date
CN115310434A (en) 2022-11-08

Similar Documents

Publication Publication Date Title
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
US11106879B2 (en) Multilingual translation device and method
CN111177184A (en) Structured query language conversion method based on natural language and related equipment thereof
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
WO2022142011A1 (en) Method and device for address recognition, computer device, and storage medium
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
CN111695355A (en) Address text recognition method, device, medium and electronic equipment
CN110442859B (en) Labeling corpus generation method, device, equipment and storage medium
CN110795938B (en) Text sequence word segmentation method, device and storage medium
US11397855B2 (en) Data standardization rules generation
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN117076653A (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN111353311A (en) Named entity identification method and device, computer equipment and storage medium
CN113642316A (en) Chinese text error correction method and device, electronic equipment and storage medium
CN111783710B (en) Information extraction method and system for medical photocopy
CN114580424A (en) Labeling method and device for named entity identification of legal document
CN116822464A (en) Text error correction method, system, equipment and storage medium
CN111401012A (en) Text error correction method, electronic device and computer readable storage medium
CN113297852B (en) Medical entity word recognition method and device
WO2024066903A1 (en) Method and device for recognizing pharmaceutical-industry target object to be recognized, and medium
CN115310434B (en) Error correction method and device for grammars of contracting documents, computer equipment and storage medium
CN112632956A (en) Text matching method, device, terminal and storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN116796730A (en) Text error correction method, device, equipment and storage medium based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant