CN113947070A

CN113947070A - Method for automatically identifying missing characters of Chinese text

Info

Publication number: CN113947070A
Application number: CN202111203237.6A
Authority: CN
Inventors: 孟奥; 王宁; 张发雨; 党章; 吴兴龙; 冯立二; 杨正云
Original assignee: Jiangsu Future Networks Innovation Institute
Current assignee: Jiangsu Future Networks Innovation Institute
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2022-01-18

Abstract

The invention provides a method for automatically identifying missing characters of a Chinese text, which is characterized by comprising an integral processing flow of the Chinese text and a specific processing flow of a short sentence of the Chinese text, wherein the Chinese text to be detected is obtained in an input or active loading mode; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. "," ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results. The invention can search the wrongly written characters and the wrongly written characters possibly existing in the Chinese text in advance, thereby greatly reducing the labor cost. The invention can be used in various Chinese texts and has wide application prospect.

Description

Method for automatically identifying missing characters of Chinese text

Technical Field

The invention relates to the field of text recognition, in particular to a method for automatically recognizing missing characters of a Chinese text.

Background

At present, a relatively good detection scheme is provided for Chinese texts with wrongly written or mispronounced characters and sensitive characters; for wrongly written characters, detection and identification are carried out in a kenlm statistical language model tool, a transformer model, a conv _ seq2seq model, a bert model, an improved model based on bert and the like; for the sensitive words, a sensitive word database can be recorded, and then detection and identification are carried out in a detection and matching mode. Although the precision rate and the recall rate of the detection and identification of the wrongly-written characters and the sensitive words are respectively high or low, the detection result can provide certain reference for people, and the manual workload is greatly reduced.

Where, in the aspect of wrongly-written-word recognition, the pycorrector is a Chinese text error correction tool. The pycorrector detects the position of the wrongly-written character according to the language model, and corrects the wrongly-written character through the phonetic sound similar characteristic, the stroke five-stroke editing distance characteristic and the language model confusion characteristic. Integrates the above mentioned multiple models and provides a fast way of using multiple models, such as: the picorrector integrated bert detection identifies wrongly written words. However, this method can only identify the wrongly written characters in the chinese text well, and if the missing characters appear in the chinese text, the identification cannot be performed well. The method is based on that the pycorector and the bert process wrongly written characters, some designs are made, and missed characters possibly existing in the Chinese text are detected and searched.

Disclosure of Invention

The invention aims to provide a method for automatically identifying missing characters of a Chinese text, which aims to solve the problems in the background technology.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for automatically identifying missing characters of a Chinese text is characterized by comprising an integral processing flow of the Chinese text and a specific processing flow of a short sentence of the Chinese text, wherein the Chinese text to be detected is obtained in an input or active loading mode; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. ","? ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results.

The specific processing flow of the Chinese short sentence comprises the following specific steps:

s1: taking a Chinese short sentence;

s2: adding a MASK to the front of a character of a subscript n of the Chinese short sentence, wherein n is 0 to the length of the Chinese short sentence which is-1 in sequence;

for example, the phrases are: today's situation is very good;

when n is 0, the result after masking is: [ MASK ] today's feelings are very good

When n is 2, the result after masking is: today's MASK is in good condition

S3: the proper words are predicted for the Chinese short sentence after the [ MASK ] MASK by using a bert model, the prediction result is only top1, and the predicted top1 information is exemplified as follows

{

' sequence ' [ CLS ] I are today very nice [ SEP ] ',

'score':0.33023348450660706,

'token':2769,

' token _ str ': me '

}；

S4: extracting the prediction information in the prediction result top1, and judging the possibility of character missing of the Chinese short sentence at the current [ MASK ] position according to the following conditions:

(1) the score value of ' score ' in the prediction result is greater than 0.90(' the score value is between 0 and 1, the numerical value is larger, the situation that the position of the Chinese short sentence [ MASK ] is more appropriate after the predicted characters are added is represented, and the threshold value of ' score ' can be adjusted according to the actual situation);

(2) the characters predicted by 'token _ str' in the prediction result are different from the characters before and after the current [ MASK ] position of the Chinese short sentence;

if the above 2 points are all satisfied, it is considered that the current [ MASK ] position may miss the value corresponding to the word 'token _ str', the prediction result is retained, and the prediction result is sorted as follows:

{

' context ' is today ' very good,

' correct _ content ' today ' is very mood,

a core,

'score':0.9414968490600586,

'pos':2,

'text_pos':2

}

each field represents the following meaning: 'context': original Chinese short sentence, 'correct _ content': adding a Chinese short sentence after the predicted missing word, 'correct word': predicted word missing, 'score': the calculation after the missing word is added, the position of 'pos' where the missing word is added into the short sentence, 'text _ pos': the position of adding missed characters into the Chinese text;

s5: subscript n +1 of the Chinese short sentence, and then repeating the steps of 2-4 until the value of n is equal to the length of the Chinese short sentence, namely-1;

s6: and outputting the result of all the reserved prediction post-processing.

The invention provides a method for realizing missing character recognition of a Chinese text, which is time-consuming and labor-consuming for manually checking whether the Chinese text has wrong characters or missing characters, can search the wrong characters and the missing characters possibly existing in the Chinese text in advance after using a recognition mode provided by the text, and greatly reduces the labor cost. The invention can be used in various Chinese texts and has wide application prospect.

Drawings

FIG. 1 is a flowchart illustrating an overall implementation of text missing word recognition in the present invention;

FIG. 2 is a flow chart of single Chinese phrase recognition in the method for missing word recognition of Chinese text according to the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work based on the embodiments of the present invention belong to the scope of the present invention.

Example 1

As shown in fig. 1, a method for automatically identifying missing characters in a chinese text is characterized by including an overall processing flow of the chinese text and a specific processing flow of a short sentence of the chinese text, and acquiring the chinese text to be detected by inputting or actively loading; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. ","? ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results.

As shown in fig. 2, the specific processing flow of the chinese clause includes the following steps:

s1: taking a Chinese short sentence;

for example, the phrases are: today's situation is very good;

When n is 2, the result after masking is: today's MASK is in good condition

{

' sequence ' [ CLS ] I are today very nice [ SEP ] ',

'score':0.33023348450660706,

'token':2769,

' token _ str ': me '

}；

{

' context ' is today ' very good,

' correct _ content ' today ' is very mood,

a core,

'score':0.9414968490600586,

'pos':2,

'text_pos':2

}

s6: and outputting the result of all the reserved prediction post-processing.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the preferred embodiments of the invention and described in the specification are only preferred embodiments of the invention and are not intended to limit the invention, and that various changes and modifications may be made without departing from the novel spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for automatically identifying missing Chinese text is characterized by comprising an overall processing flow of Chinese text and a specific processing flow of Chinese short sentences, wherein the overall processing flow of Chinese text comprises the following steps: acquiring a Chinese text to be detected in an input or active loading mode; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. ","? ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results.

2. The method for automatically recognizing missing Chinese text as claimed in claim 1, wherein: the specific processing flow of the Chinese short sentence comprises the following specific steps:

s1: taking a Chinese short sentence;

{

' sequence ' [ CLS ] I are today very nice [ SEP ] ',

'score':0.33023348450660706,

'token':2769,

' token _ str ': me '

}；

{

' context ' is today ' very good,

' correct _ content ' today ' is very mood,

a core,

'score':0.9414968490600586,

'pos':2,

'text_pos':2

}

s6: and outputting the result of all the reserved prediction post-processing.