CN113947070A - Method for automatically identifying missing characters of Chinese text - Google Patents
Method for automatically identifying missing characters of Chinese text Download PDFInfo
- Publication number
- CN113947070A CN113947070A CN202111203237.6A CN202111203237A CN113947070A CN 113947070 A CN113947070 A CN 113947070A CN 202111203237 A CN202111203237 A CN 202111203237A CN 113947070 A CN113947070 A CN 113947070A
- Authority
- CN
- China
- Prior art keywords
- chinese
- short sentence
- text
- missing
- chinese text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention provides a method for automatically identifying missing characters of a Chinese text, which is characterized by comprising an integral processing flow of the Chinese text and a specific processing flow of a short sentence of the Chinese text, wherein the Chinese text to be detected is obtained in an input or active loading mode; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. "," ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results. The invention can search the wrongly written characters and the wrongly written characters possibly existing in the Chinese text in advance, thereby greatly reducing the labor cost. The invention can be used in various Chinese texts and has wide application prospect.
Description
Technical Field
The invention relates to the field of text recognition, in particular to a method for automatically recognizing missing characters of a Chinese text.
Background
At present, a relatively good detection scheme is provided for Chinese texts with wrongly written or mispronounced characters and sensitive characters; for wrongly written characters, detection and identification are carried out in a kenlm statistical language model tool, a transformer model, a conv _ seq2seq model, a bert model, an improved model based on bert and the like; for the sensitive words, a sensitive word database can be recorded, and then detection and identification are carried out in a detection and matching mode. Although the precision rate and the recall rate of the detection and identification of the wrongly-written characters and the sensitive words are respectively high or low, the detection result can provide certain reference for people, and the manual workload is greatly reduced.
Where, in the aspect of wrongly-written-word recognition, the pycorrector is a Chinese text error correction tool. The pycorrector detects the position of the wrongly-written character according to the language model, and corrects the wrongly-written character through the phonetic sound similar characteristic, the stroke five-stroke editing distance characteristic and the language model confusion characteristic. Integrates the above mentioned multiple models and provides a fast way of using multiple models, such as: the picorrector integrated bert detection identifies wrongly written words. However, this method can only identify the wrongly written characters in the chinese text well, and if the missing characters appear in the chinese text, the identification cannot be performed well. The method is based on that the pycorector and the bert process wrongly written characters, some designs are made, and missed characters possibly existing in the Chinese text are detected and searched.
Disclosure of Invention
The invention aims to provide a method for automatically identifying missing characters of a Chinese text, which aims to solve the problems in the background technology.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for automatically identifying missing characters of a Chinese text is characterized by comprising an integral processing flow of the Chinese text and a specific processing flow of a short sentence of the Chinese text, wherein the Chinese text to be detected is obtained in an input or active loading mode; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. ","? ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results.
The specific processing flow of the Chinese short sentence comprises the following specific steps:
s1: taking a Chinese short sentence;
s2: adding a MASK to the front of a character of a subscript n of the Chinese short sentence, wherein n is 0 to the length of the Chinese short sentence which is-1 in sequence;
for example, the phrases are: today's situation is very good;
when n is 0, the result after masking is: [ MASK ] today's feelings are very good
When n is 2, the result after masking is: today's MASK is in good condition
S3: the proper words are predicted for the Chinese short sentence after the [ MASK ] MASK by using a bert model, the prediction result is only top1, and the predicted top1 information is exemplified as follows
{
' sequence ' [ CLS ] I are today very nice [ SEP ] ',
'score':0.33023348450660706,
'token':2769,
' token _ str ': me '
};
S4: extracting the prediction information in the prediction result top1, and judging the possibility of character missing of the Chinese short sentence at the current [ MASK ] position according to the following conditions:
(1) the score value of ' score ' in the prediction result is greater than 0.90(' the score value is between 0 and 1, the numerical value is larger, the situation that the position of the Chinese short sentence [ MASK ] is more appropriate after the predicted characters are added is represented, and the threshold value of ' score ' can be adjusted according to the actual situation);
(2) the characters predicted by 'token _ str' in the prediction result are different from the characters before and after the current [ MASK ] position of the Chinese short sentence;
if the above 2 points are all satisfied, it is considered that the current [ MASK ] position may miss the value corresponding to the word 'token _ str', the prediction result is retained, and the prediction result is sorted as follows:
{
' context ' is today ' very good,
' correct _ content ' today ' is very mood,
a core,
'score':0.9414968490600586,
'pos':2,
'text_pos':2
}
each field represents the following meaning: 'context': original Chinese short sentence, 'correct _ content': adding a Chinese short sentence after the predicted missing word, 'correct word': predicted word missing, 'score': the calculation after the missing word is added, the position of 'pos' where the missing word is added into the short sentence, 'text _ pos': the position of adding missed characters into the Chinese text;
s5: subscript n +1 of the Chinese short sentence, and then repeating the steps of 2-4 until the value of n is equal to the length of the Chinese short sentence, namely-1;
s6: and outputting the result of all the reserved prediction post-processing.
The invention provides a method for realizing missing character recognition of a Chinese text, which is time-consuming and labor-consuming for manually checking whether the Chinese text has wrong characters or missing characters, can search the wrong characters and the missing characters possibly existing in the Chinese text in advance after using a recognition mode provided by the text, and greatly reduces the labor cost. The invention can be used in various Chinese texts and has wide application prospect.
Drawings
FIG. 1 is a flowchart illustrating an overall implementation of text missing word recognition in the present invention;
FIG. 2 is a flow chart of single Chinese phrase recognition in the method for missing word recognition of Chinese text according to the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work based on the embodiments of the present invention belong to the scope of the present invention.
Example 1
As shown in fig. 1, a method for automatically identifying missing characters in a chinese text is characterized by including an overall processing flow of the chinese text and a specific processing flow of a short sentence of the chinese text, and acquiring the chinese text to be detected by inputting or actively loading; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. ","? ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results.
As shown in fig. 2, the specific processing flow of the chinese clause includes the following steps:
s1: taking a Chinese short sentence;
s2: adding a MASK to the front of a character of a subscript n of the Chinese short sentence, wherein n is 0 to the length of the Chinese short sentence which is-1 in sequence;
for example, the phrases are: today's situation is very good;
when n is 0, the result after masking is: [ MASK ] today's feelings are very good
When n is 2, the result after masking is: today's MASK is in good condition
S3: the proper words are predicted for the Chinese short sentence after the [ MASK ] MASK by using a bert model, the prediction result is only top1, and the predicted top1 information is exemplified as follows
{
' sequence ' [ CLS ] I are today very nice [ SEP ] ',
'score':0.33023348450660706,
'token':2769,
' token _ str ': me '
};
S4: extracting the prediction information in the prediction result top1, and judging the possibility of character missing of the Chinese short sentence at the current [ MASK ] position according to the following conditions:
(1) the score value of ' score ' in the prediction result is greater than 0.90(' the score value is between 0 and 1, the numerical value is larger, the situation that the position of the Chinese short sentence [ MASK ] is more appropriate after the predicted characters are added is represented, and the threshold value of ' score ' can be adjusted according to the actual situation);
(2) the characters predicted by 'token _ str' in the prediction result are different from the characters before and after the current [ MASK ] position of the Chinese short sentence;
if the above 2 points are all satisfied, it is considered that the current [ MASK ] position may miss the value corresponding to the word 'token _ str', the prediction result is retained, and the prediction result is sorted as follows:
{
' context ' is today ' very good,
' correct _ content ' today ' is very mood,
a core,
'score':0.9414968490600586,
'pos':2,
'text_pos':2
}
each field represents the following meaning: 'context': original Chinese short sentence, 'correct _ content': adding a Chinese short sentence after the predicted missing word, 'correct word': predicted word missing, 'score': the calculation after the missing word is added, the position of 'pos' where the missing word is added into the short sentence, 'text _ pos': the position of adding missed characters into the Chinese text;
s5: subscript n +1 of the Chinese short sentence, and then repeating the steps of 2-4 until the value of n is equal to the length of the Chinese short sentence, namely-1;
s6: and outputting the result of all the reserved prediction post-processing.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the preferred embodiments of the invention and described in the specification are only preferred embodiments of the invention and are not intended to limit the invention, and that various changes and modifications may be made without departing from the novel spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (2)
1. A method for automatically identifying missing Chinese text is characterized by comprising an overall processing flow of Chinese text and a specific processing flow of Chinese short sentences, wherein the overall processing flow of Chinese text comprises the following steps: acquiring a Chinese text to be detected in an input or active loading mode; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. ","? ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results.
2. The method for automatically recognizing missing Chinese text as claimed in claim 1, wherein: the specific processing flow of the Chinese short sentence comprises the following specific steps:
s1: taking a Chinese short sentence;
s2: adding a MASK to the front of a character of a subscript n of the Chinese short sentence, wherein n is 0 to the length of the Chinese short sentence which is-1 in sequence;
s3: the proper words are predicted for the Chinese short sentence after the [ MASK ] MASK by using a bert model, the prediction result is only top1, and the predicted top1 information is exemplified as follows
{
' sequence ' [ CLS ] I are today very nice [ SEP ] ',
'score':0.33023348450660706,
'token':2769,
' token _ str ': me '
};
S4: extracting the prediction information in the prediction result top1, and judging the possibility of character missing of the Chinese short sentence at the current [ MASK ] position according to the following conditions:
(1) the score value of ' score ' in the prediction result is greater than 0.90(' the score value is between 0 and 1, the numerical value is larger, the situation that the position of the Chinese short sentence [ MASK ] is more appropriate after the predicted characters are added is represented, and the threshold value of ' score ' can be adjusted according to the actual situation);
(2) the characters predicted by 'token _ str' in the prediction result are different from the characters before and after the current [ MASK ] position of the Chinese short sentence;
if the above 2 points are all satisfied, it is considered that the current [ MASK ] position may miss the value corresponding to the word 'token _ str', the prediction result is retained, and the prediction result is sorted as follows:
{
' context ' is today ' very good,
' correct _ content ' today ' is very mood,
a core,
'score':0.9414968490600586,
'pos':2,
'text_pos':2
}
each field represents the following meaning: 'context': original Chinese short sentence, 'correct _ content': adding a Chinese short sentence after the predicted missing word, 'correct word': predicted word missing, 'score': the calculation after the missing word is added, the position of 'pos' where the missing word is added into the short sentence, 'text _ pos': the position of adding missed characters into the Chinese text;
s5: subscript n +1 of the Chinese short sentence, and then repeating the steps of 2-4 until the value of n is equal to the length of the Chinese short sentence, namely-1;
s6: and outputting the result of all the reserved prediction post-processing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111203237.6A CN113947070A (en) | 2021-10-15 | 2021-10-15 | Method for automatically identifying missing characters of Chinese text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111203237.6A CN113947070A (en) | 2021-10-15 | 2021-10-15 | Method for automatically identifying missing characters of Chinese text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113947070A true CN113947070A (en) | 2022-01-18 |
Family
ID=79330736
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111203237.6A Pending CN113947070A (en) | 2021-10-15 | 2021-10-15 | Method for automatically identifying missing characters of Chinese text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113947070A (en) |
-
2021
- 2021-10-15 CN CN202111203237.6A patent/CN113947070A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108304372B (en) | Entity extraction method and device, computer equipment and storage medium | |
CN110853625B (en) | Speech recognition model word segmentation training method and system, mobile terminal and storage medium | |
CN107341143B (en) | Sentence continuity judgment method and device and electronic equipment | |
CN110119510B (en) | Relationship extraction method and device based on transfer dependency relationship and structure auxiliary word | |
CN114282527A (en) | Multi-language text detection and correction method, system, electronic device and storage medium | |
CN112417850A (en) | Error detection method and device for audio annotation | |
CN110321434A (en) | A kind of file classification method based on word sense disambiguation convolutional neural networks | |
CN111881297A (en) | Method and device for correcting voice recognition text | |
CN111046660B (en) | Method and device for identifying text professional terms | |
CN107797986B (en) | LSTM-CNN-based mixed corpus word segmentation method | |
CN111079384B (en) | Identification method and system for forbidden language of intelligent quality inspection service | |
CN110826301B (en) | Punctuation mark adding method, punctuation mark adding system, mobile terminal and storage medium | |
CN109062891B (en) | Media processing method, device, terminal and medium | |
CN113255329A (en) | English text spelling error correction method and device, storage medium and electronic equipment | |
CN111737424A (en) | Question matching method, device, equipment and storage medium | |
CN112069816A (en) | Chinese punctuation adding method, system and equipment | |
CN109325237B (en) | Complete sentence recognition method and system for machine translation | |
CN115688703A (en) | Specific field text error correction method, storage medium and device | |
CN113947070A (en) | Method for automatically identifying missing characters of Chinese text | |
WO2021196835A1 (en) | Method and apparatus for extracting time character string, and computer device and storage medium | |
CN109960720B (en) | Information extraction method for semi-structured text | |
Mekki et al. | COTA 2.0: An automatic corrector of tunisian Arabic social media texts | |
CN111310457B (en) | Word mismatching recognition method and device, electronic equipment and storage medium | |
CN114416991A (en) | Method and system for analyzing text emotion reason based on prompt | |
CN110858268B (en) | Method and system for detecting unsmooth phenomenon in voice translation system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |