CN113947070A - Method for automatically identifying missing characters of Chinese text - Google Patents

Method for automatically identifying missing characters of Chinese text Download PDF

Info

Publication number
CN113947070A
CN113947070A CN202111203237.6A CN202111203237A CN113947070A CN 113947070 A CN113947070 A CN 113947070A CN 202111203237 A CN202111203237 A CN 202111203237A CN 113947070 A CN113947070 A CN 113947070A
Authority
CN
China
Prior art keywords
chinese
short sentence
text
missing
chinese text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111203237.6A
Other languages
Chinese (zh)
Inventor
孟奥
王宁
张发雨
党章
吴兴龙
冯立二
杨正云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Future Networks Innovation Institute
Original Assignee
Jiangsu Future Networks Innovation Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Future Networks Innovation Institute filed Critical Jiangsu Future Networks Innovation Institute
Priority to CN202111203237.6A priority Critical patent/CN113947070A/en
Publication of CN113947070A publication Critical patent/CN113947070A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention provides a method for automatically identifying missing characters of a Chinese text, which is characterized by comprising an integral processing flow of the Chinese text and a specific processing flow of a short sentence of the Chinese text, wherein the Chinese text to be detected is obtained in an input or active loading mode; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. "," ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results. The invention can search the wrongly written characters and the wrongly written characters possibly existing in the Chinese text in advance, thereby greatly reducing the labor cost. The invention can be used in various Chinese texts and has wide application prospect.

Description

Method for automatically identifying missing characters of Chinese text
Technical Field
The invention relates to the field of text recognition, in particular to a method for automatically recognizing missing characters of a Chinese text.
Background
At present, a relatively good detection scheme is provided for Chinese texts with wrongly written or mispronounced characters and sensitive characters; for wrongly written characters, detection and identification are carried out in a kenlm statistical language model tool, a transformer model, a conv _ seq2seq model, a bert model, an improved model based on bert and the like; for the sensitive words, a sensitive word database can be recorded, and then detection and identification are carried out in a detection and matching mode. Although the precision rate and the recall rate of the detection and identification of the wrongly-written characters and the sensitive words are respectively high or low, the detection result can provide certain reference for people, and the manual workload is greatly reduced.
Where, in the aspect of wrongly-written-word recognition, the pycorrector is a Chinese text error correction tool. The pycorrector detects the position of the wrongly-written character according to the language model, and corrects the wrongly-written character through the phonetic sound similar characteristic, the stroke five-stroke editing distance characteristic and the language model confusion characteristic. Integrates the above mentioned multiple models and provides a fast way of using multiple models, such as: the picorrector integrated bert detection identifies wrongly written words. However, this method can only identify the wrongly written characters in the chinese text well, and if the missing characters appear in the chinese text, the identification cannot be performed well. The method is based on that the pycorector and the bert process wrongly written characters, some designs are made, and missed characters possibly existing in the Chinese text are detected and searched.
Disclosure of Invention
The invention aims to provide a method for automatically identifying missing characters of a Chinese text, which aims to solve the problems in the background technology.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for automatically identifying missing characters of a Chinese text is characterized by comprising an integral processing flow of the Chinese text and a specific processing flow of a short sentence of the Chinese text, wherein the Chinese text to be detected is obtained in an input or active loading mode; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. ","? ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results.
The specific processing flow of the Chinese short sentence comprises the following specific steps:
s1: taking a Chinese short sentence;
s2: adding a MASK to the front of a character of a subscript n of the Chinese short sentence, wherein n is 0 to the length of the Chinese short sentence which is-1 in sequence;
for example, the phrases are: today's situation is very good;
when n is 0, the result after masking is: [ MASK ] today's feelings are very good
When n is 2, the result after masking is: today's MASK is in good condition
S3: the proper words are predicted for the Chinese short sentence after the [ MASK ] MASK by using a bert model, the prediction result is only top1, and the predicted top1 information is exemplified as follows
{
' sequence ' [ CLS ] I are today very nice [ SEP ] ',
'score':0.33023348450660706,
'token':2769,
' token _ str ': me '
};
S4: extracting the prediction information in the prediction result top1, and judging the possibility of character missing of the Chinese short sentence at the current [ MASK ] position according to the following conditions:
(1) the score value of ' score ' in the prediction result is greater than 0.90(' the score value is between 0 and 1, the numerical value is larger, the situation that the position of the Chinese short sentence [ MASK ] is more appropriate after the predicted characters are added is represented, and the threshold value of ' score ' can be adjusted according to the actual situation);
(2) the characters predicted by 'token _ str' in the prediction result are different from the characters before and after the current [ MASK ] position of the Chinese short sentence;
if the above 2 points are all satisfied, it is considered that the current [ MASK ] position may miss the value corresponding to the word 'token _ str', the prediction result is retained, and the prediction result is sorted as follows:
{
' context ' is today ' very good,
' correct _ content ' today ' is very mood,
a core,
'score':0.9414968490600586,
'pos':2,
'text_pos':2
}
each field represents the following meaning: 'context': original Chinese short sentence, 'correct _ content': adding a Chinese short sentence after the predicted missing word, 'correct word': predicted word missing, 'score': the calculation after the missing word is added, the position of 'pos' where the missing word is added into the short sentence, 'text _ pos': the position of adding missed characters into the Chinese text;
s5: subscript n +1 of the Chinese short sentence, and then repeating the steps of 2-4 until the value of n is equal to the length of the Chinese short sentence, namely-1;
s6: and outputting the result of all the reserved prediction post-processing.
The invention provides a method for realizing missing character recognition of a Chinese text, which is time-consuming and labor-consuming for manually checking whether the Chinese text has wrong characters or missing characters, can search the wrong characters and the missing characters possibly existing in the Chinese text in advance after using a recognition mode provided by the text, and greatly reduces the labor cost. The invention can be used in various Chinese texts and has wide application prospect.
Drawings
FIG. 1 is a flowchart illustrating an overall implementation of text missing word recognition in the present invention;
FIG. 2 is a flow chart of single Chinese phrase recognition in the method for missing word recognition of Chinese text according to the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work based on the embodiments of the present invention belong to the scope of the present invention.
Example 1
As shown in fig. 1, a method for automatically identifying missing characters in a chinese text is characterized by including an overall processing flow of the chinese text and a specific processing flow of a short sentence of the chinese text, and acquiring the chinese text to be detected by inputting or actively loading; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. ","? ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results.
As shown in fig. 2, the specific processing flow of the chinese clause includes the following steps:
s1: taking a Chinese short sentence;
s2: adding a MASK to the front of a character of a subscript n of the Chinese short sentence, wherein n is 0 to the length of the Chinese short sentence which is-1 in sequence;
for example, the phrases are: today's situation is very good;
when n is 0, the result after masking is: [ MASK ] today's feelings are very good
When n is 2, the result after masking is: today's MASK is in good condition
S3: the proper words are predicted for the Chinese short sentence after the [ MASK ] MASK by using a bert model, the prediction result is only top1, and the predicted top1 information is exemplified as follows
{
' sequence ' [ CLS ] I are today very nice [ SEP ] ',
'score':0.33023348450660706,
'token':2769,
' token _ str ': me '
};
S4: extracting the prediction information in the prediction result top1, and judging the possibility of character missing of the Chinese short sentence at the current [ MASK ] position according to the following conditions:
(1) the score value of ' score ' in the prediction result is greater than 0.90(' the score value is between 0 and 1, the numerical value is larger, the situation that the position of the Chinese short sentence [ MASK ] is more appropriate after the predicted characters are added is represented, and the threshold value of ' score ' can be adjusted according to the actual situation);
(2) the characters predicted by 'token _ str' in the prediction result are different from the characters before and after the current [ MASK ] position of the Chinese short sentence;
if the above 2 points are all satisfied, it is considered that the current [ MASK ] position may miss the value corresponding to the word 'token _ str', the prediction result is retained, and the prediction result is sorted as follows:
{
' context ' is today ' very good,
' correct _ content ' today ' is very mood,
a core,
'score':0.9414968490600586,
'pos':2,
'text_pos':2
}
each field represents the following meaning: 'context': original Chinese short sentence, 'correct _ content': adding a Chinese short sentence after the predicted missing word, 'correct word': predicted word missing, 'score': the calculation after the missing word is added, the position of 'pos' where the missing word is added into the short sentence, 'text _ pos': the position of adding missed characters into the Chinese text;
s5: subscript n +1 of the Chinese short sentence, and then repeating the steps of 2-4 until the value of n is equal to the length of the Chinese short sentence, namely-1;
s6: and outputting the result of all the reserved prediction post-processing.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the preferred embodiments of the invention and described in the specification are only preferred embodiments of the invention and are not intended to limit the invention, and that various changes and modifications may be made without departing from the novel spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (2)

1. A method for automatically identifying missing Chinese text is characterized by comprising an overall processing flow of Chinese text and a specific processing flow of Chinese short sentences, wherein the overall processing flow of Chinese text comprises the following steps: acquiring a Chinese text to be detected in an input or active loading mode; preprocessing the Chinese text, and performing utf-8 unified coding processing; the encoded Chinese text is based on the 'in the Chinese text'. ","? ","! The symbol is segmented, the tail of the last sentence of the segmented punctuation mark is reserved, and the segmented Chinese text forms a Chinese short sentence list; and (3) sequentially and circularly processing each Chinese short sentence in the segmented Chinese short sentence list, predicting possible missing characters by using a bert model of a pycorector to obtain missing character results, missing character positions and other information, and sorting and outputting the results.
2. The method for automatically recognizing missing Chinese text as claimed in claim 1, wherein: the specific processing flow of the Chinese short sentence comprises the following specific steps:
s1: taking a Chinese short sentence;
s2: adding a MASK to the front of a character of a subscript n of the Chinese short sentence, wherein n is 0 to the length of the Chinese short sentence which is-1 in sequence;
s3: the proper words are predicted for the Chinese short sentence after the [ MASK ] MASK by using a bert model, the prediction result is only top1, and the predicted top1 information is exemplified as follows
{
' sequence ' [ CLS ] I are today very nice [ SEP ] ',
'score':0.33023348450660706,
'token':2769,
' token _ str ': me '
};
S4: extracting the prediction information in the prediction result top1, and judging the possibility of character missing of the Chinese short sentence at the current [ MASK ] position according to the following conditions:
(1) the score value of ' score ' in the prediction result is greater than 0.90(' the score value is between 0 and 1, the numerical value is larger, the situation that the position of the Chinese short sentence [ MASK ] is more appropriate after the predicted characters are added is represented, and the threshold value of ' score ' can be adjusted according to the actual situation);
(2) the characters predicted by 'token _ str' in the prediction result are different from the characters before and after the current [ MASK ] position of the Chinese short sentence;
if the above 2 points are all satisfied, it is considered that the current [ MASK ] position may miss the value corresponding to the word 'token _ str', the prediction result is retained, and the prediction result is sorted as follows:
{
' context ' is today ' very good,
' correct _ content ' today ' is very mood,
a core,
'score':0.9414968490600586,
'pos':2,
'text_pos':2
}
each field represents the following meaning: 'context': original Chinese short sentence, 'correct _ content': adding a Chinese short sentence after the predicted missing word, 'correct word': predicted word missing, 'score': the calculation after the missing word is added, the position of 'pos' where the missing word is added into the short sentence, 'text _ pos': the position of adding missed characters into the Chinese text;
s5: subscript n +1 of the Chinese short sentence, and then repeating the steps of 2-4 until the value of n is equal to the length of the Chinese short sentence, namely-1;
s6: and outputting the result of all the reserved prediction post-processing.
CN202111203237.6A 2021-10-15 2021-10-15 Method for automatically identifying missing characters of Chinese text Pending CN113947070A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111203237.6A CN113947070A (en) 2021-10-15 2021-10-15 Method for automatically identifying missing characters of Chinese text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111203237.6A CN113947070A (en) 2021-10-15 2021-10-15 Method for automatically identifying missing characters of Chinese text

Publications (1)

Publication Number Publication Date
CN113947070A true CN113947070A (en) 2022-01-18

Family

ID=79330736

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111203237.6A Pending CN113947070A (en) 2021-10-15 2021-10-15 Method for automatically identifying missing characters of Chinese text

Country Status (1)

Country Link
CN (1) CN113947070A (en)

Similar Documents

Publication Publication Date Title
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
CN110853625B (en) Speech recognition model word segmentation training method and system, mobile terminal and storage medium
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN110119510B (en) Relationship extraction method and device based on transfer dependency relationship and structure auxiliary word
CN114282527A (en) Multi-language text detection and correction method, system, electronic device and storage medium
CN112417850A (en) Error detection method and device for audio annotation
CN110321434A (en) A kind of file classification method based on word sense disambiguation convolutional neural networks
CN111881297A (en) Method and device for correcting voice recognition text
CN111046660B (en) Method and device for identifying text professional terms
CN107797986B (en) LSTM-CNN-based mixed corpus word segmentation method
CN111079384B (en) Identification method and system for forbidden language of intelligent quality inspection service
CN110826301B (en) Punctuation mark adding method, punctuation mark adding system, mobile terminal and storage medium
CN109062891B (en) Media processing method, device, terminal and medium
CN113255329A (en) English text spelling error correction method and device, storage medium and electronic equipment
CN111737424A (en) Question matching method, device, equipment and storage medium
CN112069816A (en) Chinese punctuation adding method, system and equipment
CN109325237B (en) Complete sentence recognition method and system for machine translation
CN115688703A (en) Specific field text error correction method, storage medium and device
CN113947070A (en) Method for automatically identifying missing characters of Chinese text
WO2021196835A1 (en) Method and apparatus for extracting time character string, and computer device and storage medium
CN109960720B (en) Information extraction method for semi-structured text
Mekki et al. COTA 2.0: An automatic corrector of tunisian Arabic social media texts
CN111310457B (en) Word mismatching recognition method and device, electronic equipment and storage medium
CN114416991A (en) Method and system for analyzing text emotion reason based on prompt
CN110858268B (en) Method and system for detecting unsmooth phenomenon in voice translation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination