CN110110334B

CN110110334B - Remote consultation record text error correction method based on natural language processing

Info

Publication number: CN110110334B
Application number: CN201910379327.7A
Authority: CN
Inventors: 赵杰; 翟运开; 石金铭; 崔莉亚; 陈昊天; 李明原; 宋晓琴; 王振博
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2022-09-13
Anticipated expiration: 2039-05-08
Also published as: CN110110334A

Abstract

The invention discloses a remote consultation record text error correction method based on natural language processing, which belongs to the technical field of big data and comprises a central server and a plurality of client sides, wherein a preprocessing module, a database, an error checking module and an error correction module are established in the central server, so that the technical problem of automatic error checking and automatic error correction of texts is solved.

Description

Remote consultation record text error correction method based on natural language processing

Technical Field

The invention belongs to the technical field of big data, and particularly relates to a remote consultation recording text error correction method based on natural language processing.

Background

With the development of computer technology and network technology, remote medical consultation has become an important component of modern medical system. The consultation result list is generally input by remote department staff, and characters are more than needed, characters are missing, and spelling errors occur in the input process, so that special manpower or a system is needed to check and correct the texts. At present, the work of checking the recorded consultation advice note in the remote medical consultation process is still mainly manual, which not only wastes time but also wastes labor, so that the automatic checking of the text information formed in the remote medical consultation process is of great significance in the field of remote medicine.

Disclosure of Invention

The invention aims to provide a remote consultation recording text error correction method based on natural language processing, and the method solves the technical problem of automatic text error checking and correction.

In order to achieve the purpose, the invention adopts the following technical scheme:

a remote consultation record text error correction method based on natural language processing comprises the following steps:

step 1: deploying a central server and a plurality of clients, establishing a preprocessing module, a database, an error checking module and an error correcting module in the central server, and communicating all the clients with the central server through the Internet;

step 2: inputting a plurality of original texts through any client, sending the original texts to a central server by the client, storing all the original texts in a database by the central server, and establishing a training database for storing and accumulating the original texts in the database;

and 3, step 3: classifying original texts in a training database into completely correct texts and error texts, performing word segmentation and word segmentation on the completely correct texts and the error texts, labeling the training corpora according to the error positions and the error types of the corpora in the original texts, setting a label C to represent correct, a label R to represent redundancy, a label D to represent missing, a label O to represent incorrect generations, and a label M to represent missing;

calling a CRF conditional random field, and obtaining a training model by using training corpora;

and 4, step 4: the method comprises the following steps that a text to be processed is input through any client, the client transmits the text to be processed to a central server, and a preprocessing module in the central server preprocesses the text to be processed, and the method comprises the following steps:

step A1: performing word segmentation and word segmentation on a text to be processed;

step A2: marking the participles and the participles in the text to be processed as a test corpus;

and 5: the error checking module in the central server checks the error of the text to be processed, and the steps are as follows:

step B1: debugging the test corpus in the text to be processed according to the training model and the CRF conditional random field to obtain a CRF conditional random field debugging result;

step B2: traversing all scattered strings in the text to be processed, and carrying out n-gram scattered string debugging on the text to be processed to obtain an n-gram scattered string debugging result;

step B3: fusing a conditional random field error checking result and an n-gram scattered string error checking result, labeling the text to be processed, and obtaining a final result of text error checking;

step 6: the central server inputs the final result of the text error checking obtained in the step 5 into an error correction module, and the error correction module corrects the final result of the text error checking, wherein the steps are as follows:

step C1: constructing a language model to correct missing errors;

step C2: directly deleting the words or words containing the redundant error marks;

step C3: correcting the words containing the misregistration marks in the text by utilizing the homophone dictionary to complete the automatic error correction function of the text;

step C4: outputting error correction text;

and 7: the method comprises the steps that a main text proofreading interface, a text error checking interface and a text error correcting interface are established at a client, a central server packs and sends a final result of text error checking, an error correcting text and a text to be processed to the client, and the client displays the text to be processed, the final result of text error checking and the error correcting text on the main text proofreading interface, the text error checking interface and the text error correcting interface respectively.

Preferably, when step 3 is executed, the SnowNLP library is used to perform word segmentation and word segmentation on the completely correct text and the error text.

Preferably, in performing step C1, the language model selects a trigram, which is represented as a word w at the ith position _i With the preceding two words w _i-1 And w _i-2 In this regard, the formula is:

wherein P represents a conditional probability; s represents a current sentence or symbol string; n represents the number of preceding and following character strings, and is 3.

Preferably, the method for correcting the missing error of the final result of the text debugging by using the trigram model needs to correctly label the linguistic data of the error type and the error position, and comprises the following steps:

step S1: performing word segmentation on the final result of the input text error checking, and searching out the missing label M;

step S2: extracting the former and latter words marked by M and recording the extracted words in the missing text;

step S3: traversing the constructed tri-language model, and judging whether the words recorded in the missing text are the same as the first and third words in a certain sentence of the dictionary: if not, the error correction fails, the next missing mark is continuously searched, and the steps S1 to S3 are repeatedly executed until all the missing texts are corrected; if so, go to step S5;

step S5: judging whether the data is unique: if unique, the second word of the selected sentence is the missing word; if not, selecting a second word in the sentence with higher word frequency as the missing word; and (4) if the error correction is successful, continuing to search for the next missing mark, and repeatedly executing the steps S1 to S5 until the missing text is completely corrected.

Preferably, when step C3 is executed, the method corrects the words in the text containing the misregistration marks by using a homophone dictionary, and includes the following steps:

step T1: segmenting words of the text, traversing the segmented text, searching for a wrong generation error label O, labeling the words with pinyin, and recording the pinyin, wherein the previous word labeled with the wrong generation error belongs to the wrong generation error;

step T2: judging whether the pinyin exists in the constructed homophone dictionary: if not, indicating that the error correction fails; if the candidate words exist, the pronounciation words are described to exist in the mispronunciation word language, and the homophones are used as the mispronunciation candidate words;

step T3: and sequentially substituting all homophones of the mispronounced candidate words into the original sentence, respectively calculating the probability of the sentences and arranging the sentences in a descending order, and taking the homophones in the first-ordered sentence as the corrected words of the mispronounced candidate words.

The invention relates to a remote consultation recording text error correction method based on natural language processing, which solves the technical problem of automatic error checking and correction of texts, utilizes CRF and n-gram dispersed strings to check errors and corrects the errors according to specific error reasons, achieves a very high level on the error correction capability, improves the accuracy compared with the prior art, and can relieve the working pressure of workers in a remote diagnosis department and improve the working efficiency.

Drawings

FIG. 1 is a general flow chart of the present invention;

FIG. 2 is a CRF error checking flow chart of the present invention;

FIG. 3 is a flow chart of the n-gama hash error checking of the present invention;

FIG. 4 is a flow chart of the construction of the language model of the present invention;

FIG. 5 is a flow chart of construction of the homophone dictionary of the present invention.

Detailed Description

As shown in fig. 1-5, a remote consultation recording text error correction method based on natural language processing includes the following steps:

and 2, step: inputting a plurality of original texts through any client, sending the original texts to a central server by the client, storing all the original texts in a database by the central server, and establishing a training database for storing and accumulating the original texts in the database;

and step 3: classifying original texts in a training database into completely correct texts and error texts, performing word segmentation and word segmentation on the completely correct texts and the error texts, labeling the training corpora according to the error positions and the error types of the corpora in the original texts, setting a label C to represent correct, a label R to represent redundancy, a label D to represent missing, a label O to represent incorrect generations, and a label M to represent missing;

step B1: debugging test corpora in the text to be processed according to the training model and the CRF conditional random field to obtain a debugging result of the CRF conditional random field;

the CRF conditional random field is a discriminant probability model, and is used for analyzing a text as a word sequence or a character sequence by using the CRF conditional random field, wherein X is { X ═ X ₁ x ₂ ...x _n Denotes an observation sequence, Y ═ Y ₁ y ₂ ...y _n Denotes the marker sequence, and conditional random fields can yield the characteristic function:

where i denotes the position in the sentence of the observation sequence and z _x Indicating the normalization of the annotated observation sequence, p indicating the probability of the type of prediction error, and λ indicating the assigned weight;j represents a few feature functions; f represents a characteristic function;

n-gram scattered string debugging, deducing the structural relationship of a sentence through statistical data of the simultaneous occurrence probability of n symbols in a symbol string of a natural language, and using w _i Indicating the language symbol to be currently present, in prediction w _i The probability of occurrence of (a) is expressed as p (w) by considering the preceding n-1 tokens _i |w _i-n+1 w _i-n+2 w _i-n+3 ...w _i-1 ) Where n takes 3.

Step B3: fusing a conditional random field error checking result and an n-gram scattered string error checking result, and labeling the text to be processed to obtain a final text error checking result;

step C1: constructing a language model to correct missing errors;

step C4: outputting error correction text;

Preferably, in the fieldIn line C1, the language model selects a trigram language model represented as the word w at the ith position _i With the preceding two words w _i-1 And w _i-2 In this regard, the formula is:

step S1: segmenting the final result of the error checking of the input text, and checking out the missing label M;

step S3: traversing the constructed ternary language model, and judging whether the words recorded in the missing text are the same as the first and third words in a certain sentence of the dictionary: if not, the error correction fails, the next missing mark is continuously searched, and the steps S1 to S3 are repeatedly executed until all the missing texts are corrected; if so, go to step S5;

step T2: judging whether the pinyin exists in the constructed homophone dictionary: if not, indicating error correction failure; if the candidate words exist, the pronounciation words are described to exist in the mispronunciation word language, and the homophones are used as the mispronunciation candidate words;

The invention relates to a remote consultation record text error correction method based on natural language processing, which solves the technical problem of automatic error checking and correction of texts.

Claims

1. A remote consultation record text error correction method based on natural language processing is characterized in that: the method comprises the following steps:

step 2: inputting a plurality of original texts through any client, sending the original texts to a central server by the client, storing all the original texts into a database by the central server, and establishing a training database for storing and accumulating the original texts in the database;

and step 3: classifying original texts in a training database into completely correct texts and error texts, performing word segmentation and word segmentation on the completely correct texts and the error texts, labeling the training corpuses according to error positions and error types of the corpuses in the original texts, setting a label C to represent correct, a label R to represent redundancy, a label D to represent missing, a label O to represent missing, and a label M to represent missing;

step B2: traversing all scattered strings in the text to be processed, and carrying out n-gram scattered string error checking on the text to be processed to obtain an n-gram scattered string error checking result;

step B3: fusing the conditional random field error checking result and the n-gram scattered string error checking result, and labeling the text to be processed to obtain a final text error checking result;

step C1: constructing a language model to correct missing errors;

step C4: outputting error correction text;

and 7: the method comprises the steps that a main text proofreading interface, a text error checking interface and a text error correction interface are established at a client, a central server packs and sends a final result of text error checking, an error correction text and a text to be processed to the client, and the client displays the text to be processed, the final result of text error checking and the error correction text on the main text proofreading interface, the text error checking interface and the text error correction interface respectively.

2. The method of claim 1, wherein the method comprises the steps of: and in the step 3, performing word segmentation and word segmentation on the completely correct text and the error text by utilizing a SnowNLP library.

3. The remote consultation recording text correction method based on natural language processing according to claim 1, characterized in that: in performing step C1, the language model selects a trigram language model, which is represented as a word w at the ith position _i With the preceding two words w _i-1 And w _i-2 In this regard, the formula is:

wherein P represents a conditional probability; s represents a current sentence or symbol string; n represents the number of symbols.

4. The remote consultation recording text correction method based on natural language processing according to claim 3, characterized in that: the method comprises the following steps of correcting missing errors of a final result of text debugging by using a ternary language model, and correctly marking the linguistic data of error types and error positions, wherein the steps are as follows:

step S5: judging whether the code is unique: if unique, the second word of the selected sentence is the missing word; if not, selecting a second word in the sentence with higher word frequency as the missing word; and (4) if the error correction is successful, continuing to search for the next missing mark, and repeatedly executing the steps S1 to S5 until the missing text is completely corrected.

5. The remote consultation recording text correction method based on natural language processing according to claim 1, characterized in that: when step C3 is executed, the method corrects the words containing the misregistration marks in the text by using the homophone dictionary, and includes the following steps:

step T1: segmenting a text, traversing the segmented text, searching a wrong generation error label O, labeling the pinyin of the word and recording the pinyin, wherein the previous word labeled with the label O belongs to the wrong generation error;

step T2: judging whether the pinyin exists in the constructed homophone dictionary: if not, indicating error correction failure; if the candidate words exist, the pronounciation words are used as the candidate words for the wrong generation, which indicates that the misgeneration pronounciation words exist;