CN110110334B - Remote consultation record text error correction method based on natural language processing - Google Patents

Remote consultation record text error correction method based on natural language processing Download PDF

Info

Publication number
CN110110334B
CN110110334B CN201910379327.7A CN201910379327A CN110110334B CN 110110334 B CN110110334 B CN 110110334B CN 201910379327 A CN201910379327 A CN 201910379327A CN 110110334 B CN110110334 B CN 110110334B
Authority
CN
China
Prior art keywords
text
error
words
missing
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910379327.7A
Other languages
Chinese (zh)
Other versions
CN110110334A (en
Inventor
赵杰
翟运开
石金铭
崔莉亚
陈昊天
李明原
宋晓琴
王振博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou University
Original Assignee
Zhengzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou University filed Critical Zhengzhou University
Priority to CN201910379327.7A priority Critical patent/CN110110334B/en
Publication of CN110110334A publication Critical patent/CN110110334A/en
Application granted granted Critical
Publication of CN110110334B publication Critical patent/CN110110334B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H80/00ICT specially adapted for facilitating communication between medical practitioners or patients, e.g. for collaborative diagnosis, therapy or health monitoring

Abstract

The invention discloses a remote consultation record text error correction method based on natural language processing, which belongs to the technical field of big data and comprises a central server and a plurality of client sides, wherein a preprocessing module, a database, an error checking module and an error correction module are established in the central server, so that the technical problem of automatic error checking and automatic error correction of texts is solved.

Description

Remote consultation record text error correction method based on natural language processing
Technical Field
The invention belongs to the technical field of big data, and particularly relates to a remote consultation recording text error correction method based on natural language processing.
Background
With the development of computer technology and network technology, remote medical consultation has become an important component of modern medical system. The consultation result list is generally input by remote department staff, and characters are more than needed, characters are missing, and spelling errors occur in the input process, so that special manpower or a system is needed to check and correct the texts. At present, the work of checking the recorded consultation advice note in the remote medical consultation process is still mainly manual, which not only wastes time but also wastes labor, so that the automatic checking of the text information formed in the remote medical consultation process is of great significance in the field of remote medicine.
Disclosure of Invention
The invention aims to provide a remote consultation recording text error correction method based on natural language processing, and the method solves the technical problem of automatic text error checking and correction.
In order to achieve the purpose, the invention adopts the following technical scheme:
a remote consultation record text error correction method based on natural language processing comprises the following steps:
step 1: deploying a central server and a plurality of clients, establishing a preprocessing module, a database, an error checking module and an error correcting module in the central server, and communicating all the clients with the central server through the Internet;
step 2: inputting a plurality of original texts through any client, sending the original texts to a central server by the client, storing all the original texts in a database by the central server, and establishing a training database for storing and accumulating the original texts in the database;
and 3, step 3: classifying original texts in a training database into completely correct texts and error texts, performing word segmentation and word segmentation on the completely correct texts and the error texts, labeling the training corpora according to the error positions and the error types of the corpora in the original texts, setting a label C to represent correct, a label R to represent redundancy, a label D to represent missing, a label O to represent incorrect generations, and a label M to represent missing;
calling a CRF conditional random field, and obtaining a training model by using training corpora;
and 4, step 4: the method comprises the following steps that a text to be processed is input through any client, the client transmits the text to be processed to a central server, and a preprocessing module in the central server preprocesses the text to be processed, and the method comprises the following steps:
step A1: performing word segmentation and word segmentation on a text to be processed;
step A2: marking the participles and the participles in the text to be processed as a test corpus;
and 5: the error checking module in the central server checks the error of the text to be processed, and the steps are as follows:
step B1: debugging the test corpus in the text to be processed according to the training model and the CRF conditional random field to obtain a CRF conditional random field debugging result;
step B2: traversing all scattered strings in the text to be processed, and carrying out n-gram scattered string debugging on the text to be processed to obtain an n-gram scattered string debugging result;
step B3: fusing a conditional random field error checking result and an n-gram scattered string error checking result, labeling the text to be processed, and obtaining a final result of text error checking;
step 6: the central server inputs the final result of the text error checking obtained in the step 5 into an error correction module, and the error correction module corrects the final result of the text error checking, wherein the steps are as follows:
step C1: constructing a language model to correct missing errors;
step C2: directly deleting the words or words containing the redundant error marks;
step C3: correcting the words containing the misregistration marks in the text by utilizing the homophone dictionary to complete the automatic error correction function of the text;
step C4: outputting error correction text;
and 7: the method comprises the steps that a main text proofreading interface, a text error checking interface and a text error correcting interface are established at a client, a central server packs and sends a final result of text error checking, an error correcting text and a text to be processed to the client, and the client displays the text to be processed, the final result of text error checking and the error correcting text on the main text proofreading interface, the text error checking interface and the text error correcting interface respectively.
Preferably, when step 3 is executed, the SnowNLP library is used to perform word segmentation and word segmentation on the completely correct text and the error text.
Preferably, in performing step C1, the language model selects a trigram, which is represented as a word w at the ith position i With the preceding two words w i-1 And w i-2 In this regard, the formula is:
Figure BDA0002052813060000031
wherein P represents a conditional probability; s represents a current sentence or symbol string; n represents the number of preceding and following character strings, and is 3.
Preferably, the method for correcting the missing error of the final result of the text debugging by using the trigram model needs to correctly label the linguistic data of the error type and the error position, and comprises the following steps:
step S1: performing word segmentation on the final result of the input text error checking, and searching out the missing label M;
step S2: extracting the former and latter words marked by M and recording the extracted words in the missing text;
step S3: traversing the constructed tri-language model, and judging whether the words recorded in the missing text are the same as the first and third words in a certain sentence of the dictionary: if not, the error correction fails, the next missing mark is continuously searched, and the steps S1 to S3 are repeatedly executed until all the missing texts are corrected; if so, go to step S5;
step S5: judging whether the data is unique: if unique, the second word of the selected sentence is the missing word; if not, selecting a second word in the sentence with higher word frequency as the missing word; and (4) if the error correction is successful, continuing to search for the next missing mark, and repeatedly executing the steps S1 to S5 until the missing text is completely corrected.
Preferably, when step C3 is executed, the method corrects the words in the text containing the misregistration marks by using a homophone dictionary, and includes the following steps:
step T1: segmenting words of the text, traversing the segmented text, searching for a wrong generation error label O, labeling the words with pinyin, and recording the pinyin, wherein the previous word labeled with the wrong generation error belongs to the wrong generation error;
step T2: judging whether the pinyin exists in the constructed homophone dictionary: if not, indicating that the error correction fails; if the candidate words exist, the pronounciation words are described to exist in the mispronunciation word language, and the homophones are used as the mispronunciation candidate words;
step T3: and sequentially substituting all homophones of the mispronounced candidate words into the original sentence, respectively calculating the probability of the sentences and arranging the sentences in a descending order, and taking the homophones in the first-ordered sentence as the corrected words of the mispronounced candidate words.
The invention relates to a remote consultation recording text error correction method based on natural language processing, which solves the technical problem of automatic error checking and correction of texts, utilizes CRF and n-gram dispersed strings to check errors and corrects the errors according to specific error reasons, achieves a very high level on the error correction capability, improves the accuracy compared with the prior art, and can relieve the working pressure of workers in a remote diagnosis department and improve the working efficiency.
Drawings
FIG. 1 is a general flow chart of the present invention;
FIG. 2 is a CRF error checking flow chart of the present invention;
FIG. 3 is a flow chart of the n-gama hash error checking of the present invention;
FIG. 4 is a flow chart of the construction of the language model of the present invention;
FIG. 5 is a flow chart of construction of the homophone dictionary of the present invention.
Detailed Description
As shown in fig. 1-5, a remote consultation recording text error correction method based on natural language processing includes the following steps:
step 1: deploying a central server and a plurality of clients, establishing a preprocessing module, a database, an error checking module and an error correcting module in the central server, and communicating all the clients with the central server through the Internet;
and 2, step: inputting a plurality of original texts through any client, sending the original texts to a central server by the client, storing all the original texts in a database by the central server, and establishing a training database for storing and accumulating the original texts in the database;
and step 3: classifying original texts in a training database into completely correct texts and error texts, performing word segmentation and word segmentation on the completely correct texts and the error texts, labeling the training corpora according to the error positions and the error types of the corpora in the original texts, setting a label C to represent correct, a label R to represent redundancy, a label D to represent missing, a label O to represent incorrect generations, and a label M to represent missing;
calling a CRF conditional random field, and obtaining a training model by using training corpora;
and 4, step 4: the method comprises the following steps that a text to be processed is input through any client, the client transmits the text to be processed to a central server, and a preprocessing module in the central server preprocesses the text to be processed, and the method comprises the following steps:
step A1: performing word segmentation and word segmentation on a text to be processed;
step A2: marking the participles and the participles in the text to be processed as a test corpus;
and 5: the error checking module in the central server checks the error of the text to be processed, and the steps are as follows:
step B1: debugging test corpora in the text to be processed according to the training model and the CRF conditional random field to obtain a debugging result of the CRF conditional random field;
the CRF conditional random field is a discriminant probability model, and is used for analyzing a text as a word sequence or a character sequence by using the CRF conditional random field, wherein X is { X ═ X 1 x 2 ...x n Denotes an observation sequence, Y ═ Y 1 y 2 ...y n Denotes the marker sequence, and conditional random fields can yield the characteristic function:
Figure BDA0002052813060000051
Figure BDA0002052813060000052
where i denotes the position in the sentence of the observation sequence and z x Indicating the normalization of the annotated observation sequence, p indicating the probability of the type of prediction error, and λ indicating the assigned weight;j represents a few feature functions; f represents a characteristic function;
step B2: traversing all scattered strings in the text to be processed, and carrying out n-gram scattered string debugging on the text to be processed to obtain an n-gram scattered string debugging result;
n-gram scattered string debugging, deducing the structural relationship of a sentence through statistical data of the simultaneous occurrence probability of n symbols in a symbol string of a natural language, and using w i Indicating the language symbol to be currently present, in prediction w i The probability of occurrence of (a) is expressed as p (w) by considering the preceding n-1 tokens i |w i-n+1 w i-n+2 w i-n+3 ...w i-1 ) Where n takes 3.
Step B3: fusing a conditional random field error checking result and an n-gram scattered string error checking result, and labeling the text to be processed to obtain a final text error checking result;
step 6: the central server inputs the final result of the text error checking obtained in the step 5 into an error correction module, and the error correction module corrects the final result of the text error checking, wherein the steps are as follows:
step C1: constructing a language model to correct missing errors;
step C2: directly deleting the words or words containing the redundant error marks;
step C3: correcting the words containing the misregistration marks in the text by utilizing the homophone dictionary to complete the automatic error correction function of the text;
step C4: outputting error correction text;
and 7: the method comprises the steps that a main text proofreading interface, a text error checking interface and a text error correcting interface are established at a client, a central server packs and sends a final result of text error checking, an error correcting text and a text to be processed to the client, and the client displays the text to be processed, the final result of text error checking and the error correcting text on the main text proofreading interface, the text error checking interface and the text error correcting interface respectively.
Preferably, when step 3 is executed, the SnowNLP library is used to perform word segmentation and word segmentation on the completely correct text and the error text.
Preferably, in the fieldIn line C1, the language model selects a trigram language model represented as the word w at the ith position i With the preceding two words w i-1 And w i-2 In this regard, the formula is:
Figure BDA0002052813060000061
wherein P represents a conditional probability; s represents a current sentence or symbol string; n represents the number of preceding and following character strings, and is 3.
Preferably, the method for correcting the missing error of the final result of the text debugging by using the trigram model needs to correctly label the linguistic data of the error type and the error position, and comprises the following steps:
step S1: segmenting the final result of the error checking of the input text, and checking out the missing label M;
step S2: extracting the former and latter words marked by M and recording the extracted words in the missing text;
step S3: traversing the constructed ternary language model, and judging whether the words recorded in the missing text are the same as the first and third words in a certain sentence of the dictionary: if not, the error correction fails, the next missing mark is continuously searched, and the steps S1 to S3 are repeatedly executed until all the missing texts are corrected; if so, go to step S5;
step S5: judging whether the data is unique: if unique, the second word of the selected sentence is the missing word; if not, selecting a second word in the sentence with higher word frequency as the missing word; and (4) if the error correction is successful, continuing to search for the next missing mark, and repeatedly executing the steps S1 to S5 until the missing text is completely corrected.
Preferably, when step C3 is executed, the method corrects the words in the text containing the misregistration marks by using a homophone dictionary, and includes the following steps:
step T1: segmenting words of the text, traversing the segmented text, searching for a wrong generation error label O, labeling the words with pinyin, and recording the pinyin, wherein the previous word labeled with the wrong generation error belongs to the wrong generation error;
step T2: judging whether the pinyin exists in the constructed homophone dictionary: if not, indicating error correction failure; if the candidate words exist, the pronounciation words are described to exist in the mispronunciation word language, and the homophones are used as the mispronunciation candidate words;
step T3: and sequentially substituting all homophones of the mispronounced candidate words into the original sentence, respectively calculating the probability of the sentences and arranging the sentences in a descending order, and taking the homophones in the first-ordered sentence as the corrected words of the mispronounced candidate words.
The invention relates to a remote consultation record text error correction method based on natural language processing, which solves the technical problem of automatic error checking and correction of texts.

Claims (5)

1. A remote consultation record text error correction method based on natural language processing is characterized in that: the method comprises the following steps:
step 1: deploying a central server and a plurality of clients, establishing a preprocessing module, a database, an error checking module and an error correcting module in the central server, and communicating all the clients with the central server through the Internet;
step 2: inputting a plurality of original texts through any client, sending the original texts to a central server by the client, storing all the original texts into a database by the central server, and establishing a training database for storing and accumulating the original texts in the database;
and step 3: classifying original texts in a training database into completely correct texts and error texts, performing word segmentation and word segmentation on the completely correct texts and the error texts, labeling the training corpuses according to error positions and error types of the corpuses in the original texts, setting a label C to represent correct, a label R to represent redundancy, a label D to represent missing, a label O to represent missing, and a label M to represent missing;
calling a CRF conditional random field, and obtaining a training model by using training corpora;
and 4, step 4: the method comprises the following steps that a text to be processed is input through any client, the client transmits the text to be processed to a central server, and a preprocessing module in the central server preprocesses the text to be processed, and the method comprises the following steps:
step A1: performing word segmentation and word segmentation on a text to be processed;
step A2: marking the participles and the participles in the text to be processed as a test corpus;
and 5: the error checking module in the central server checks the error of the text to be processed, and the steps are as follows:
step B1: debugging the test corpus in the text to be processed according to the training model and the CRF conditional random field to obtain a CRF conditional random field debugging result;
step B2: traversing all scattered strings in the text to be processed, and carrying out n-gram scattered string error checking on the text to be processed to obtain an n-gram scattered string error checking result;
step B3: fusing the conditional random field error checking result and the n-gram scattered string error checking result, and labeling the text to be processed to obtain a final text error checking result;
step 6: the central server inputs the final result of the text error checking obtained in the step 5 into an error correction module, and the error correction module corrects the final result of the text error checking, wherein the steps are as follows:
step C1: constructing a language model to correct missing errors;
step C2: directly deleting the words or words containing the redundant error marks;
step C3: correcting the words containing the misregistration marks in the text by utilizing the homophone dictionary to complete the automatic error correction function of the text;
step C4: outputting error correction text;
and 7: the method comprises the steps that a main text proofreading interface, a text error checking interface and a text error correction interface are established at a client, a central server packs and sends a final result of text error checking, an error correction text and a text to be processed to the client, and the client displays the text to be processed, the final result of text error checking and the error correction text on the main text proofreading interface, the text error checking interface and the text error correction interface respectively.
2. The method of claim 1, wherein the method comprises the steps of: and in the step 3, performing word segmentation and word segmentation on the completely correct text and the error text by utilizing a SnowNLP library.
3. The remote consultation recording text correction method based on natural language processing according to claim 1, characterized in that: in performing step C1, the language model selects a trigram language model, which is represented as a word w at the ith position i With the preceding two words w i-1 And w i-2 In this regard, the formula is:
Figure FDA0002052813050000021
wherein P represents a conditional probability; s represents a current sentence or symbol string; n represents the number of symbols.
4. The remote consultation recording text correction method based on natural language processing according to claim 3, characterized in that: the method comprises the following steps of correcting missing errors of a final result of text debugging by using a ternary language model, and correctly marking the linguistic data of error types and error positions, wherein the steps are as follows:
step S1: performing word segmentation on the final result of the input text error checking, and searching out the missing label M;
step S2: extracting the former and latter words marked by M and recording the extracted words in the missing text;
step S3: traversing the constructed ternary language model, and judging whether the words recorded in the missing text are the same as the first and third words in a certain sentence of the dictionary: if not, the error correction fails, the next missing mark is continuously searched, and the steps S1 to S3 are repeatedly executed until all the missing texts are corrected; if so, go to step S5;
step S5: judging whether the code is unique: if unique, the second word of the selected sentence is the missing word; if not, selecting a second word in the sentence with higher word frequency as the missing word; and (4) if the error correction is successful, continuing to search for the next missing mark, and repeatedly executing the steps S1 to S5 until the missing text is completely corrected.
5. The remote consultation recording text correction method based on natural language processing according to claim 1, characterized in that: when step C3 is executed, the method corrects the words containing the misregistration marks in the text by using the homophone dictionary, and includes the following steps:
step T1: segmenting a text, traversing the segmented text, searching a wrong generation error label O, labeling the pinyin of the word and recording the pinyin, wherein the previous word labeled with the label O belongs to the wrong generation error;
step T2: judging whether the pinyin exists in the constructed homophone dictionary: if not, indicating error correction failure; if the candidate words exist, the pronounciation words are used as the candidate words for the wrong generation, which indicates that the misgeneration pronounciation words exist;
step T3: and sequentially substituting all homophones of the mispronounced candidate words into the original sentence, respectively calculating the probability of the sentences and arranging the sentences in a descending order, and taking the homophones in the first-ordered sentence as the corrected words of the mispronounced candidate words.
CN201910379327.7A 2019-05-08 2019-05-08 Remote consultation record text error correction method based on natural language processing Active CN110110334B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910379327.7A CN110110334B (en) 2019-05-08 2019-05-08 Remote consultation record text error correction method based on natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910379327.7A CN110110334B (en) 2019-05-08 2019-05-08 Remote consultation record text error correction method based on natural language processing

Publications (2)

Publication Number Publication Date
CN110110334A CN110110334A (en) 2019-08-09
CN110110334B true CN110110334B (en) 2022-09-13

Family

ID=67488670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910379327.7A Active CN110110334B (en) 2019-05-08 2019-05-08 Remote consultation record text error correction method based on natural language processing

Country Status (1)

Country Link
CN (1) CN110110334B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765836B (en) * 2019-08-28 2022-04-29 云知声智能科技股份有限公司 Text positioning method and system based on natural language understanding
CN110532562B (en) * 2019-08-30 2021-07-16 联想(北京)有限公司 Neural network training method, idiom misuse detection method and device and electronic equipment
CN110826304A (en) * 2019-11-13 2020-02-21 北京雅丁信息技术有限公司 Medical corpus labeling method
CN113627191A (en) * 2021-07-05 2021-11-09 中国气象局公共气象服务中心(国家预警信息发布中心) Automatic labeling method and system for meteorological early warning sample semantics
CN116975298B (en) * 2023-09-22 2023-12-05 厦门智慧思明数据有限公司 NLP-based modernized society governance scheduling system and method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
WO2018120889A1 (en) * 2016-12-28 2018-07-05 平安科技(深圳)有限公司 Input sentence error correction method and device, electronic device, and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于条件随机场模型和文本纠错的微博新词词性识别研究;韩彦昭等;《南京大学学报(自然科学)》;20160331;第32卷(第02期);第353-360页 *

Also Published As

Publication number Publication date
CN110110334A (en) 2019-08-09

Similar Documents

Publication Publication Date Title
CN110110334B (en) Remote consultation record text error correction method based on natural language processing
CN110489760B (en) Text automatic correction method and device based on deep neural network
US20210157975A1 (en) Device, system, and method for extracting named entities from sectioned documents
CN100511215C (en) Multilingual translation memory and translation method thereof
CN109960728B (en) Method and system for identifying named entities of open domain conference information
CN109460552B (en) Method and equipment for automatically detecting Chinese language diseases based on rules and corpus
CN109471793B (en) Webpage automatic test defect positioning method based on deep learning
CN110276069B (en) Method, system and storage medium for automatically detecting Chinese braille error
CN105279149A (en) Chinese text automatic correction method
Hossain et al. Auto-correction of english to bengali transliteration system using levenshtein distance
CN111062376A (en) Text recognition method based on optical character recognition and error correction tight coupling processing
CN111062397A (en) Intelligent bill processing system
Saluja et al. Error detection and corrections in Indic OCR using LSTMs
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
Wang et al. A probabilistic address parser using conditional random fields and stochastic regular grammar
KR101072460B1 (en) Method for korean morphological analysis
CN111401012A (en) Text error correction method, electronic device and computer readable storage medium
CN115935914A (en) Admission record missing text supplementing method
Pal et al. Vartani Spellcheck--Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance
Kang et al. Two approaches for the resolution of word mismatch problem caused by English words and foreign words in Korean information retrieval
CN116166768A (en) Text knowledge extraction method and system based on rules
CN115618883A (en) Business semantic recognition method and device
Abdulmalek et al. Levenstein's Algorithm On English and Arabic: A Survey
Généreux et al. NLP challenges in dealing with OCR-ed documents of derogated quality
Wibowo et al. Spelling checker of words in rejang language using the n-gram and euclidean distance methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant