CN114510926A - Text error correction method, text error correction device and electronic equipment - Google Patents

Text error correction method, text error correction device and electronic equipment Download PDF

Info

Publication number
CN114510926A
CN114510926A CN202210134582.7A CN202210134582A CN114510926A CN 114510926 A CN114510926 A CN 114510926A CN 202210134582 A CN202210134582 A CN 202210134582A CN 114510926 A CN114510926 A CN 114510926A
Authority
CN
China
Prior art keywords
target
replacement
characters
semantic
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210134582.7A
Other languages
Chinese (zh)
Inventor
罗达雄
时从斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vivo Mobile Communication Co Ltd
Original Assignee
Vivo Mobile Communication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vivo Mobile Communication Co Ltd filed Critical Vivo Mobile Communication Co Ltd
Priority to CN202210134582.7A priority Critical patent/CN114510926A/en
Publication of CN114510926A publication Critical patent/CN114510926A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application discloses a text error correction method, a text error correction device and electronic equipment, and belongs to the field of artificial intelligence. The text error correction method comprises the following steps: determining a target error position from the target text; processing the target character at the target error position to generate a semantic replacement character set, a phonetic replacement character set and a shape replacement character set corresponding to the target character; generating a target replacement character set corresponding to the target character based on the semantic replacement character set, the phonetic replacement character set and the shape replacement character set; correcting the target text based on the target replacement character set; the alternative characters in the semantic alternative character set are characters with similar semantics to the target characters, the alternative characters in the phonetic alternative character set are characters with similar pinyin to the target characters, and the alternative characters in the shape similar alternative character set are characters with similar font to the target characters.

Description

Text error correction method, text error correction device and electronic equipment
Technical Field
The application belongs to the field of artificial intelligence, and particularly relates to a text error correction method, a text error correction device and electronic equipment.
Background
When a user inputs a text by using the input device, the input device can correct the text to improve the accuracy of the text, for example, the error correction based on the phonetic approximation type, after the user selects the proofreading mode to input the related text, if the text has phonetic approximation errors, the input rule can display a single error correction result of the error position, so that the user can select whether to correct the text.
The error correction method provided by the input method can not solve the problem of error correction, has single error correction capability and low accuracy of an error correction result.
Disclosure of Invention
The embodiment of the application aims to provide a text error correction method, a text error correction device and electronic equipment, which can solve the problem that the accuracy of the existing error correction result is not high.
In a first aspect, an embodiment of the present application provides a text error correction method, where the method includes:
determining a target error position from the target text;
processing the target character at the target error position to generate a semantic replacement character set, a phonetic replacement character set and a shape replacement character set corresponding to the target character;
generating a target replacement character set corresponding to the target character based on the semantic replacement character set, the phonetic replacement character set and the shape replacement character set;
correcting the target text based on the target replacement character set;
the alternative characters in the semantic alternative character set are characters with similar semantics to the target characters, the alternative characters in the phonetic alternative character set are characters with similar pinyin to the target characters, and the alternative characters in the shape similar alternative character set are characters with similar font to the target characters.
In a second aspect, an embodiment of the present application provides a text error correction apparatus, including:
the first determining module is used for determining a target error position from the target text;
the first processing module is used for processing the target character at the target error position to generate a semantic replacement character set, a phonetic replacement character set and a shape replacement character set corresponding to the target character;
the second processing module is used for generating a target replacing character set corresponding to the target character based on the semantic replacing character set, the phonetic near replacing character set and the shape near replacing character set;
a third processing module, configured to correct the error of the target text based on the target replacement character set;
the alternative characters in the semantic alternative character set are characters with similar semantics to the target characters, the alternative characters in the phonetic alternative character set are characters with similar pinyin to the target characters, and the alternative characters in the shape similar alternative character set are characters with similar font to the target characters.
In a third aspect, embodiments of the present application provide an electronic device, which includes a processor and a memory, where the memory stores a program or instructions executable on the processor, and the program or instructions, when executed by the processor, implement the steps of the method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.
In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.
In a sixth aspect, embodiments of the present application provide a computer program product, stored on a storage medium, for execution by at least one processor to implement the method according to the first aspect.
In the embodiment of the application, the target replacement character set corresponding to the target error position is generated from multiple dimensions, so that the accuracy, the precision and the comprehensiveness of the target replacement characters contained in the target replacement character set can be improved; and correcting the target text based on the target replacement character set, and preferentially displaying the target replacement characters with higher accuracy to the user on the basis of providing more target replacement characters to the user, thereby being beneficial to improving the error correction efficiency.
Drawings
Fig. 1 is a schematic flowchart of a text error correction method provided in an embodiment of the present application;
fig. 2 is a second schematic flowchart of a text error correction method according to an embodiment of the present application;
fig. 3 is a third schematic flowchart of a text error correction method according to an embodiment of the present application;
fig. 4 is a fourth schematic flowchart of a text error correction method according to an embodiment of the present application;
fig. 5 is a fifth flowchart illustrating a text error correction method according to an embodiment of the present application;
FIG. 6 is a schematic interface diagram of a text correction method provided in an embodiment of the present application;
fig. 7 is a schematic structural diagram of a text error correction apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;
fig. 9 is a hardware schematic diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.
The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.
In the related art, there are two ways to perform text error correction:
first, error correction based on the type of proximity. After the user inputs the related text, if the text is detected to have a near error, the input method prompts a replacing position and a replacing word to enable the user to select whether to correct the error, such as correcting the weather to be really good.
Second, error correction based on the shape proximity type. After the user inputs the related text, if the text is detected to have a form error, the input method prompts a replacing position and a replacing word to enable the user to select whether to correct the error, such as correcting the 'end' in the 'end' urgent water flow to be 'turbulent'.
The above two text error correction methods have the following problems:
on one hand, the current input method only gives a unique error correction result of each error, the number of candidates is small, and a user can only choose to accept or not accept error correction, so that the use range is narrow, and the problem of error correction cannot be effectively solved;
on the other hand, when the algorithm does not detect errors, the text input by the user is directly determined to be correct, and the operation flexibility of the user is low;
on the other hand, whether the error correction is based on the near type or the error correction is based on the near type, the corresponding error correction capability is limited, and the covered situations are less, so that the error correction capability is influenced.
The text error correction method, the text error correction device, the electronic device, and the readable storage medium provided in the embodiments of the present application are described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.
The text error correction method may be applied to the terminal, and may be specifically executed by hardware or software in the terminal.
The terminal includes, but is not limited to, a mobile phone or other portable communication device such as a tablet computer having a display screen. It should also be understood that in some embodiments, the terminal may not be a portable communication device, but rather a desktop computer having a display screen.
In the following various embodiments, a terminal including a display and a touch-sensitive surface is described. However, it should be understood that the terminal may include one or more other physical user interface devices such as a physical keyboard, mouse, and joystick.
In the text error correction method provided by the embodiment of the application, an execution main body of the text error correction method may be an electronic device or a functional module or a functional entity capable of implementing the text error correction method in the electronic device, the electronic device mentioned in the embodiment of the application includes, but is not limited to, a mobile phone, a tablet computer, a camera, a wearable device, and the like, and the text error correction method provided by the embodiment of the application is described below by taking the electronic device as the execution main body.
As shown in fig. 1, the text error correction method includes: step 110, step 120, step 130 and step 140.
Step 110, determining a target error position from a target text;
in this step, the target text is the text that is input by the user and needs to be corrected, such as a word, a phrase, a word or an article input by the user.
The error position is the position of the character with error in the target text or the position of the character with error probability greater than the preset value in the target text.
One or more erroneous characters may be present in the same target text, and thus one or more erroneous locations may also be present.
The target error position is a position corresponding to a character needing error correction in the one or more error positions.
It will be appreciated that in the case of normal operation of the input device, the user may enter target text through the input device, including but not limited to text entry or voice entry through an input method.
It should be noted that the determination of the target error position may be automatically determined by the input device, or may also be manually determined by the user, and the specific determination manner will be described in the following embodiments, which is not repeated herein.
In the actual implementation process, one or more error positions existing in the same target text may be constructed into a set, and a Candidate set of error positions is generated and is referred to as Candidate.
Step 120, processing the target character at the target error position to generate a semantic replacement character set, a phonetic replacement character set and a shape replacement character set corresponding to the target character; the method comprises the following steps that a semantic replacement character set is used for replacing a character, a phonetic replacement character set is used for replacing a character, wherein the replaced character in the semantic replacement character set is a character with similar semantics with a target character, the replaced character in the phonetic replacement character set is a character with similar pinyin with the target character, and the replaced character in the shape similar replacement character set is a character with similar font with the target character;
in this step, the replacement character is a character for replacing the target character at the target error position in the target text.
It is understood that semantically similar characters, i.e., characters expressed with similar meanings, or characters expressed with similar entities, are generally used for error correction of common characters or proper nouns, including but not limited to: place name, person name, organization name, game name, movie name, etc., such as the Hangzhou region and the Yuhangzhou region of the fish.
Characters with similar pinyin, i.e. characters with same pronunciation or similar pronunciation, include characters with same pinyin and same tone, and characters with same pinyin but different tones, such as 'good' and 'huge'.
Characters with similar fonts, namely characters with similar shapes or strokes of characters, such as 'ends' and 'turbulence' and the like.
Each of the semantic replacement character set, the phonetic replacement character set and the shape replacement character set may include a plurality of replacement characters, for example, 30 or 50 replacement characters; and different character sets may include partially identical replacement characters or may include different replacement characters.
In the actual execution process, a plurality of initial semantic replacement characters corresponding to the target character can be generated based on the semantic meaning of the target error position. For example, after the user inputs a fish Hangzhou region, the terminal recognizes that the possible semantic meaning at the position is a special place name based on the semantic meaning, generates a semantic replacement word 'YuHangzhou region' similar to the 'fish Hangzhou region', and receives the 'YuHangzhou region' in a semantic replacement character set.
Based on the pinyin for the target character at the target error location, a plurality of near alternative characters are generated that are near to the target character. Continuing to take the 'fish Hangzhou region' as an example, based on the pinyin 'yu' of the 'fish', other characters with similar pronunciation are generated, such as: and, field, in, after, rain, and pre, and the like, and the characters of and, field, in, after, rain, and pre, and the like are received in the near-sound replacement character set.
Based on the font of the target character at the target error position, a plurality of font-near replacement characters near the font thereof are generated. Such as "fish" -based glyphs, to generate other characters with similar shapes or strokes, such as: the characters of the tortoise, the complex and the standby are collected in the character set with the shape similar to the replacement character set.
The implementation of step 120 is described below by way of specific embodiments.
In some embodiments, the semantic replacement character set comprises a first semantic replacement character set and a second semantic replacement character set, and step 120 may comprise:
generating a plurality of initial semantic replacing characters corresponding to the target error positions and generating probabilities corresponding to the initial semantic replacing characters based on the target word list;
determining N initial semantic replacement characters based on the generation probability, and sequencing the N initial semantic replacement characters based on the generation probability to generate a first semantic replacement character set;
replacing target characters corresponding to the target error positions with first semantic replacing characters in the first semantic replacing character set respectively;
generating a replaced replacement score based on the conditional probability corresponding to the first semantic replacement character;
and sequencing the first semantic replacement characters in the first semantic replacement character set based on the replacement scores to generate a second semantic replacement character set.
In this embodiment, N is a positive integer, and the value of N may be customized by a user, such as setting to 30 or 50, and the application is not limited thereto.
The initial semantic replacement character is any one semantic replacement character generated based on the target vocabulary.
The first semantic replacement character set is a character set determined based on the generation probability of the initial semantic replacement character, and the semantic replacement characters in the first semantic replacement character set can be the initial semantic replacement characters with the generation probability of high-to-low order; the second semantic replacement character set is a character set determined based on the replacement score of the first semantic replacement character, and the semantic replacement characters in the second semantic replacement character set can be the first semantic replacement characters with the replacement scores from high to low.
The generation probability is the magnitude of the probability of generating the initial semantic replacement character at the target error position when the initial semantic replacement character is generated by using the target vocabulary, for example, when the 'fish' in the 'fish Hangzhou region' is replaced based on the target vocabulary, the probability of generating the 'surplus' is a, the probability of generating the 'yu' is b, and a > b.
It should be noted that the target vocabulary in this embodiment is a common vocabulary, and certainly, in other embodiments, the target vocabulary may also be used for customization, which is not limited in this application.
In some embodiments, a bi-directional Encoder Representation (Bert) model may be employed to generate the recall score for the initial semantic replacement character, and the recall score is taken as the generation probability.
The replacement score is used to characterize the accuracy of the resulting new word in the entire target text after the first semantic replacement character replaces the target character.
In some embodiments, a confusion score (ppl) may be employed to represent the replacement score.
It will be appreciated that ppl is used to predict the probability of occurrence of a string of characters from each character, for example, for a string S of length n, the formula may be:
Figure BDA0003503975200000051
determining ppl of the character string S, wherein p (wi | w1, …, wi-1) is conditional probability, wiFor characterizing the ith character in the string S. For example, after the target text "welcome to the Hangzhou area of fish" is replaced by "the rest" in the formula, the probability of the situation that the target text "welcome to the Hangzhou area" appears is estimated based on the "Hangzhou area", so that the ppl score corresponding to the "Hangzhou area" is generated to be c, after the target text "welcome to the Hangzhou area" is replaced by "the Yu" in the formula, the probability of the situation that the target text "welcome to the Hangzhou area" appears is estimated based on the "Hangzhou area", so that the ppl score corresponding to the "Hangzhou area" is generated to be d, and c is larger than d.
In the actual execution process, after a plurality of initial semantic replacement characters, and the generation probability and the replacement score corresponding to each initial semantic replacement character are obtained, the initial semantic replacement characters can be sorted from high to low on the basis of the generation probability to obtain N initial semantic replacement characters with the top rank, so that a first semantic replacement set is obtained; and then, reordering the initial semantic replacement characters of the target number based on the replacement score to obtain a second semantic replacement set.
If so, selecting an initial semantic replacement character generating probability top50, and generating a first semantic replacement character set, wherein the initial semantic replacement character in the first semantic replacement character is the first semantic replacement character;
and then reordering the first semantic replacement characters based on the replacement scores of the first semantic replacement characters to generate a second semantic replacement character set, wherein the semantic replacement characters in the second semantic replacement characters are the second semantic replacement characters.
It can be understood that, for the same initial semantic replacement character, the generation probability and the replacement score corresponding to the same initial semantic replacement character may not be the same, and for any two initial semantic replacement characters, it is possible that the generation probability corresponding to one of the initial semantic replacement characters is higher than the generation probability corresponding to the other initial semantic replacement character, but the replacement score corresponding to the one of the initial semantic replacement characters is lower than the replacement score corresponding to the other initial semantic replacement character.
Then, the initial semantic replacement characters in the first semantic replacement character set and the second semantic replacement character set are the same, but the ordering of the same initial semantic replacement character in the first semantic replacement character set and the second semantic replacement character set may not be completely the same.
For example, as shown in fig. 3, in an actual execution process, for each error position in the Candidate, the corresponding words may be sequentially replaced with [ MASK ] identifiers according to a sequence from front to back, a Bert model is used to obtain probabilities of generating all characters at the target error position, so as to obtain a plurality of initial semantic replacement characters and generation probabilities corresponding to the initial semantic replacement characters, and then, the target number of initial semantic replacement characters with the generation probabilities before are selected as first semantic replacement characters with similar semantics, for example, the initial semantic character of top50 is selected as the first semantic replacement character with similar semantics, and a specific flow is as described below.
1) For pos belonging to Candidate, replacing the pos with [ MASK ], and recording the new text as NewEerrorSentence, wherein pos is used for representing the position of the character in the target text, and the NewEerrorSentence is the new text obtained after the character at each error position is replaced with [ MASK ] identification;
2) modeling the error position marked with the [ MASK ] identification through Bert to obtain a semantic representation h, wherein h is a set of all semantic information possibly generated at the error position in the target text;
3) and (5) processing h by using a Classiier classification layer to obtain the probability Vocab Pro of the initial semantic replacement characters with similar semantics corresponding to the target error position on the target vocabulary, wherein the Vocab Pro is the generation probability of the initial semantic replacement characters with similar semantics generated based on the target vocabulary, namely the recall score.
4) Obtain Top50 with higher probability in Vocab Pro as the candidate set of the first semantic characters with similar semantics at the target error position, and record it as:
wordCandidate=[(Bert1,BertScore1),…,(Bert50,BertScore50)]
the word command is a first semantic replacement character set corresponding to the target character, Bert1 is a first semantic replacement character corresponding to the target character, Bert50 is a 50 th first semantic replacement character corresponding to the target character, BertScore1 is a generation probability corresponding to the first semantic replacement character, and BertScore50 is a generation probability corresponding to the 50 th first semantic replacement character.
In this embodiment, the generating probability corresponding to the initial semantic replacement character, that is, the recall score, is calculated by the BERT recall function, which has higher accuracy.
In some embodiments, the first semantic replacement characters in the first semantic replacement character set may be sorted from high to low according to the generation probability, and recorded as a bert channel.
FIG. 4 shows an example of error correction for the target text "today weather is really expensive", where the top few first semantic replacement characters in the bert channel are: "good" and "bar" and the like.
After obtaining the wordCandidate, aiming at a first semantic replaced character in the wordCandidate, calculating a ppl score after the first semantic replaced character replaces a target character by using a language model, and reordering the wordCandidate based on the ppl score to obtain a pplCandiddate, wherein the pplCandiddate is a second semantic replaced character, and the specific flow is as follows.
1) Can be determined by the formula:
Figure BDA0003503975200000071
and calculating a ppl score after each first semantic replacing character in the wordCandidate replaces the target character at the target error position, wherein the ppl is a replacing score, namely a semantic score, p (wi | w1, …, wi-1) is an ngram conditional probability, wi is a character corresponding to the ith position in the target text, and n is the length of the target text.
Thus, pplCandidate was obtained, which was written as:
pplCandidate=[(ppl1,pplScore1),…,(ppl50,pplScore50)]
the pplCandidate is a second semantic replacement character set corresponding to the target character, the ppl1 is a first second semantic replacement character corresponding to the target character, the ppl50 is a 50 th second semantic replacement character corresponding to the target character, the pplScore1 is a generation probability corresponding to the first second semantic replacement character, and the pplScore50 is a generation probability corresponding to the 50 th second semantic replacement character.
In some embodiments, the second semantic replacement characters in the second semantic replacement character set may also be sorted from high to low by the replacement score, denoted as a ppl channel.
With continued reference to FIG. 4, wherein the top few second semantic replacement characters in the ppl channel are: "good" and "hot", etc.
In some embodiments, the nearness score may be represented using an edit distance, with a smaller edit distance representing a higher nearness score.
Based on the pinyin of the target character at the target error position, as hao (hao), the editing distance (i.e. the extension degree) is sequentially calculated with the pronunciation of the target word list, the first 50 words with the minimum editing distance are obtained and recorded as yinjin Candidate:
yinjinCandidate=[(PY1,PyScore1),…,(PY50,PyScore50)]
wherein, yinjinccandidate is a near alternative character set, PY1 is a first near alternative character corresponding to the target character, PY50 is a 50 th near alternative character corresponding to the target character, PyScore1 is an edit distance corresponding to the first near alternative character, and PyScore50 is an edit distance corresponding to the 50 th near alternative character.
In some embodiments, the nearness replacement characters in the nearness replacement character set may be sorted from high to low by the nearness score, and labeled as the PY channel.
With continued reference to FIG. 4, the top few note replacement characters in the PY channel are: "good" and "bad", etc.
Sorting the yinjin Candidate in a descending manner according to PyScore, and recording the sorted yinjin Candidate as a PY channel;
in some embodiments, the nearness score may be represented using an edit distance, with a smaller edit distance representing a higher nearness score.
Based on the glyph of the target character at the target error position, as large (stroke number: 4413121251), where the number represents the stroke number, the edit distance is calculated sequentially from the glyphs in the target vocabulary, and the first 50 characters with the minimum edit distance are obtained, and written as xingjinCandidate:
xingjinCandidate=[(ZX1,ZxScore1),…,(ZX50,ZxScore50)]
wherein xingjinCandidate is a near shape replacement character set, ZX1 is a first near shape replacement character corresponding to the target character, ZX50 is a 50 th near shape replacement character corresponding to the target character, zxsore 1 is an edit distance corresponding to the first near shape replacement character, and zxsore 50 is an edit distance corresponding to the 50 th near shape replacement character.
In some embodiments, the shape near replacement characters in the shape near replacement character set may be sorted from high to low by shape near score, and recorded as a ZX channel.
With continued reference to FIG. 4, wherein the top few shape-near replacement characters in the ZX channel are: "poor" and "high", etc.
In the step, on one hand, the semantic replacement characters are sequenced and selected from the aspects of generating probability and replacing fraction, so that the error correction capability of the entity words is better supported; on the other hand, the problem that the generated replacement characters are not complete due to single calculation dimension can be avoided by generating the replacement character set corresponding to each dimension from multiple dimensions such as semantics (entities), phonetic approximation, form approximation and the like, and the accuracy and the comprehensiveness of the finally generated target replacement character set in the subsequent process are improved.
And step 130, generating a target replacement character set corresponding to the target character based on the semantic replacement character set, the phonetic replacement character set and the shape replacement character set.
In this step, the target replacement character set is the replacement character set that is finally generated and displayed.
Respectively selecting one or more replacing characters from the semantic replacing character set, the phonetic replacing character set and the shape replacing character set, generating a final replacing character set based on all the extracted replacing characters, and outputting the final replacing character set as an error correcting candidate set for error correction.
It can be understood that, in the case that the semantic replacement character set includes a first semantic replacement character set and a second semantic replacement character set, one or more replacement characters are respectively selected from the first semantic replacement character set, the second semantic replacement character set, the near replacement character set, and a final replacement character set is generated based on all the extracted replacement characters, and the final replacement character set is output as an error correction candidate set for error correction.
In the actual execution process, the terminal outputs and displays the target replacement character set for the user to select, the user clicks the target replacement character in the target replacement character set, the terminal responds to the input of the user, determines the target replacement character as a final replacement character, replaces the target character at the target position in the target text with the target replacement character, and updates the target text, so that text error correction is completed.
A specific implementation of step 130 is described below.
In some embodiments, step 130 may include:
extracting target semantic replacing characters from the semantic replacing character set, extracting target phonetic replacing characters from the phonetic replacing character set, and extracting target shape replacing characters from the shape replacing character set;
generating a target feature fusion vector based on target replacement characters, wherein the target replacement characters comprise target semantic replacement characters, target phonetic replacement characters and target shape replacement characters;
and generating a target replacement character set corresponding to the target character based on the target feature fusion vector.
In this embodiment, the semantic replacement character set may be any one or more of the first semantic replacement character set and the second semantic replacement character set.
Under the condition that the semantic replacement character set is a first semantic replacement character set, the target semantic replacement character is a target first semantic replacement character; under the condition that the semantic replacement character set is a second semantic replacement character set, the target semantic replacement character is a target second semantic replacement character; in the case where the semantic replacement character set includes a first semantic replacement character set and a second semantic replacement character set, the target semantic replacement character includes a target first semantic replacement character and a target second semantic replacement character.
The target replacing characters comprise replacing characters extracted from a semantic replacing character set, a phonetic near replacing character set and a shape near replacing character set respectively; that is, the target replacement characters include a target semantic replacement character, a target phonetic replacement character, and a target shape replacement character.
Generating a target feature fusion vector based on the target replacement character may be expressed as generating a target feature fusion vector based on the target semantic replacement character, the target phonetic replacement character, and the target shape replacement character.
The following describes this embodiment by taking an example in which the semantic replacement character set includes a first semantic replacement character set and a second semantic replacement character set.
In some embodiments, step 130 may include:
extracting a target first semantic replacement character from the first semantic replacement character set, extracting a target second semantic replacement character from the second semantic replacement character set, extracting a target phonetic near replacement character from the phonetic near replacement character set, and extracting a target shape near replacement character from the shape near replacement character set;
generating a target feature fusion vector based on the target first semantic replacing character, the target second semantic replacing character, the target phonetic near replacing character and the target shape near replacing character;
and generating a target replacement character set corresponding to the target character based on the target feature fusion vector.
In this embodiment, the target first semantic replacement character is at least one first semantic replacement character extracted from the first semantic replacement character set and having the highest generation probability, the target second semantic replacement character is at least one second semantic replacement character extracted from the second semantic replacement character set and having the highest replacement score, the target nearness replacement character is at least one nearness replacement character extracted from the nearness replacement character set and having the highest nearness score, and the target nearness shape replacement character is at least one shape replacement character extracted from the shape replacement character set and having the highest shape score.
It is to be understood that one or more top-ranked replacement characters may be extracted from each of the replacement character sets, such as 2 or 3 top-ranked replacement characters from each of the replacement character sets, respectively.
This embodiment will be described below by taking as an example the case where 2 top-ranked alternative characters are extracted from each alternative character set.
For example, the substitute character of each top2 in the Bert channel, the ppl channel, the PY channel, and the ZX channel is selected as the final mixed candidate set, and the following results are obtained:
Top8=[Bert1,Bert2,ppl1,ppl2,PY1,PY2,ZX1,ZX2]
wherein top8 is the final mixed candidate set, Bert1 and Bert2 are two first semantic replacement characters in a Bert channel which generate the top2 of the probability, ppl1 and ppl2 are two second semantic replacement characters in a ppl channel which replace the top2 of the score, PY1 and PY2 are two near replacement characters in a PY channel which replace the top2 of the score, and ZX1 and ZX2 are two similar near replacement characters in a ZX channel which replace the top2 of the score.
With continued reference to fig. 4, the final hybrid candidate set includes: "good", "bar", "good", "hot", "good", "vinasse", "poor" and "high".
And performing fusion and other processing on the replaced characters in the final mixed candidate set to generate a target feature fusion vector.
The target feature fusion vector is a vector for representing all the replaced characters in the final mixed candidate combination and the numerical values corresponding to the replaced characters.
In the step, one or more replacing characters are respectively selected from replacing character sets corresponding to multiple dimensions such as semantics, phonetic approximation, shape approximation and the like to generate the target feature fusion vector, so that the comprehensiveness is higher, and the replacing characters contained in the target replacing character set generated based on the target feature fusion vector are more comprehensive and more accurate.
In some embodiments, generating the target feature fusion vector based on the target replacement character includes:
acquiring a target semantic score of a target semantic replacing character, a target phonetic near score of the target phonetic near replacing character and a target shape near score of the target shape near replacing character;
and generating a target feature fusion vector based on the target semantic replacing characters, the target semantic scores, the target phonetic approximation replacing characters, the target phonetic approximation scores, the target shape approximation replacing characters and the target shape approximation scores.
In this embodiment, the target replacement characters include a target semantic replacement character, a target phonetic replacement character, and a target shape replacement character.
It is to be appreciated that the target semantic replacement character can include at least one of a target first semantic replacement character and a target second semantic replacement character.
Under the condition that the semantic replacement character set is a first semantic replacement character set, the target semantic replacement character is a target first semantic replacement character; under the condition that the semantic replacement character set is a second semantic replacement character set, the target semantic replacement character is a target second semantic replacement character; in the case where the semantic replacement character set includes a first semantic replacement character set and a second semantic replacement character set, the target semantic replacement character includes a target first semantic replacement character and a target second semantic replacement character.
The target semantic score includes at least one of a target generation probability and a target replacement score, and the target semantic score has a correspondence with the target semantic replacement character.
For example, in the case where the target semantic replacement character is the target first semantic replacement character, the target semantic score is the target generation probability; and under the condition that the target semantic replacing character is a target second semantic replacing character, the target semantic score is a target replacing score.
This embodiment will be described below by taking an example in which the target semantic replacement character includes a target first semantic replacement character and a target second semantic replacement character.
In some embodiments, generating the target feature fusion vector based on the target replacement character may include:
acquiring a target generation probability of a target first semantic replacing character, a target replacing score of a target second semantic replacing character, a target near score of a target near replacing character and a target near score of a target near replacing character;
and generating a target feature fusion vector based on the target first semantic replacing character, the target generating probability, the target second semantic replacing character, the target replacing score, the target phonetic approximation replacing character, the target phonetic approximation score, the target shape approximation replacing character and the target shape approximation score.
In this embodiment, the target generation probability is a generation probability of a target first semantic replacement character, the target replacement score is a replacement score of a second semantic replacement character, the target nearness score is a nearness score of the target nearness replacement character, and the target shape nearness score is a shape nearness score of the target shape nearness character.
This embodiment will be described by taking the above Top8 set as an example.
After Top8 ═ Bert1, Bert2, ppl1, ppl2, PY1, PY2, ZX1, ZX2] is obtained, an id representation in the target vocabulary vocab is found for Top8, as shown in fig. 4, where "good" id in vocab is 27 and "bar" id in vocab is 50; and obtaining the position of each replaced character according to the id increasing sequence, wherein the position can be specifically obtained by a formula:
index=getSortIndex(getVocabIndex(w)for w in Top8,word),word∈Top8
and obtaining the position of each alternative character, wherein word is an alternative character in Top8, w is used for representing any alternative character in Top8, getVocabIndex function is used for obtaining id of w in a target word list, getSoreIndex function is used for obtaining the sorting position of word in all alternative characters in Top8, and index is used for representing the sorting position of w in Top 8.
As shown in FIG. 4, ID Sort of "good", "hot", "good", "bad", and "high" are sorted and ordered based on ID increasing order, resulting in ID Sort of "good", "hot", "bad", "high", and "bad".
For the correct character, the index declares one-hot vector as the prediction target of model prediction, which can be specifically represented by the formula:
label=[0,…,1index,…0]
and predicting, where label is a vector of a target length, for example, a vector of a length of 8, and index characterizes the sorting position of the correct character in the final mixed candidate set, where the index position is 1 and the rest positions are 0.
As shown in fig. 4, a "good" label is: [1, 0, 0, …, 0 ].
In the step, a small vocabulary classification model based on word ID alignment is adopted for scoring the replacement characters in the mixed replacement character set, so that the method has high accuracy.
Then, for the replaced character in Top8 and its scores in the four channels, by the formula:
feature=[Bert1,Bert2,pp1,ppl2,PY1,PY2,ZX1,ZX2,BertScore1,BertScore2,pplScore1,pplScore2,PyScore1,PyScore2,ZxScore1,ZxScore2,Bert1==ppl1,Bert1==PY1,Bert==ZX1,PY1==PPL1,ZX==PPl1,ZX1==PY1,Max(aisle),aisle in(wordCandidate,pplCandidate,xingjinCandidate,xingjinCandidate)
Min(aisle),aisle in(wordCandidate,pplCandidate,xingjinCandidate,xingjinCandidate)]
and constructing a target fusion feature vector, wherein feature is the target fusion feature vector, BertScore is the generation probability corresponding to a first semantic replacing character in a Bert channel, pplScore is the replacing score corresponding to a second semantic replacing character in a ppl channel, PylScore is the phonetic near score corresponding to a phonetic near replacing character in a PY channel, and ZxScore is the shape near score corresponding to a shape near replacing character in a ZX channel.
After the target feature fusion vector is obtained, a target replacement character set corresponding to the target character can be generated based on the target feature fusion vector.
The target replacement character set comprises replacement characters corresponding to the target feature fusion vectors, and all the replacement characters are sorted from high to low according to the final target scores.
For example, based on the final candidate set [ "good", "bar", "good", "hot", "good", "bad", "high" ], the target replacement character set is finally generated as [ good "," bar "," hot "," bad "," high "].
In this embodiment, one or more replacement characters are respectively selected from replacement character sets corresponding to multiple dimensions, such as semantics, phonetic approximation, shape approximation, and the like, to generate a final mixed candidate set, and a target feature fusion vector is generated based on the generation probability, the replacement score, the phonetic approximation score, and the shape approximation score corresponding to the replacement characters in the mixed candidate set, so that the comprehensiveness and accuracy of the result can be significantly improved.
In some embodiments, generating a target replacement character set corresponding to the target character based on the target feature fusion vector may include:
generating a target score corresponding to the target replacement character based on the target feature fusion vector;
and sequencing the target replacement characters based on the target scores to generate a target replacement character set corresponding to the target characters.
In this embodiment, the target score is used to characterize the probability that the target replacement character is the correct word.
It should be noted that the target score is different from the target semantic score, the target phonetic score, and the target shape score. The target score is generated by fusing evaluation indexes of multiple dimensions such as semantics, pronunciation, font and the like.
It can be understood that one or more replacing characters are extracted from the semantic replacing character set, the phonetic replacing character set and the shape replacing character set respectively, the replacing characters finally form a new set, namely a final mixed candidate set, and the replacing characters in the final mixed candidate set are target replacing characters.
Each target replacement character in the final mixed candidate set corresponds to a target score.
And sequencing all the target replacement characters in the final mixed candidate set based on the numerical value of the target score, namely sequencing the target semantic replacement characters, the target phonetic replacement characters and the target shape replacement characters based on the target score, so as to generate a target replacement character set corresponding to the target characters.
In the actual execution process, after the final mixed candidate set is obtained, the probability that each target replacement character in the final mixed candidate set is a correct character is respectively calculated based on each target replacement character in the final mixed candidate set, and the target replacement characters in the final mixed candidate set are sequenced according to the probability, so that a target replacement character set is generated.
For example, for the final candidate set [ "good", "hot", "good", "bad", "high" ], a target score for "good" is calculated as p1, a target score for "good" is calculated as p2, a target score for "hot" is calculated as p3, a target score for "bad" is calculated as p4, a target score for "bad" is calculated as p5, and a target score for "high" is calculated as p6, where p1 > p2 > p3 > p4 > p5 > p 6.
Then the several replacement characters, good, bar, hot, vintage, bad and high, are reordered according to the target score to generate the target replacement character set as good, bar, hot, vintage, bad and high.
After the target replacement character set is generated, the target replacement character set can be displayed for the user to select, and the user selects one of the target replacement character sets as an error correction result at the target position in the target text.
In the embodiment, the target feature fusion vectors are scored to generate the target scores of all the replacement characters in the mixed candidate set corresponding to the target feature fusion vectors, and the replacement characters are sorted based on the target scores to preferentially display the replacement characters with higher accuracy to the user, so that the time for the user to search for the correct replacement characters is reduced, and the error correction efficiency is improved.
In some embodiments, a neural network model may be employed to generate a target score for each replacement character in the mixed candidate set corresponding to the target fused feature vector.
This embodiment will be specifically explained below.
For example, an optimized distributed Gradient Boosting (XGBoost) model may be employed as the neural network model for generating the target score.
In the actual execution process, feature is used as an input vector of the XGboost model, the input vector is input into the XGboost model, and the XGboost model outputs a target score. The XGboost model is obtained by training by taking a sample fusion characteristic vector as a sample and taking a sample score corresponding to the sample fusion characteristic vector as a sample label.
In the actual training process, a (feature, label) can be used as training data to train the XGBoost model, where the feature in the (feature, label) is a sample fusion feature vector, and the label is a sample label.
In some embodiments, training data can be constructed in a data enhancement mode to carry out model training, so that the robustness of the model is enhanced. Mainly comprises two steps of selecting an error-prone position and replacing the error-prone position with an error-prone sample.
Wherein, for the selection of the error-prone position, the following formula is adopted:
s=sorigen-stop1
selecting, wherein s is a score corresponding to the error-prone position and is used for representing the probability of the error at the position, and s isorigenIs the Bert score, s corresponding to the original character of the current positiontop1Best score for the current position BertHigh characters correspond to scores.
It will be appreciated that a smaller score s indicates that the current location is more likely to be replaced. For each sample, the position most likely to be replaced, i.e., the position corresponding to the smallest s, is selected as the "error prone position".
For the alternative of "error-prone samples", one can use the formula:
w=argmaxw(sww in D)
performing replacement, wherein w is the confusion character with the highest Bert score at the current position, D is the confusion set of the replaced position, argmaxwFor selecting the character with the highest Bert score confusion as the substitute character, s, from the set DwIs the Bert score when the confusing character at the current position is w. That is, the word with the highest Bert score at the "error prone position" is selected, i.e., the word most likely to be confused replaces the original word.
An enhanced data set can be constructed through the two steps to train the whole model, so that the accuracy of model output is improved.
It should be noted that the target feature fusion vector obtained in each practical application process can be used as a training sample in a subsequent model training process.
In the embodiment, the XGboost model is adopted to score the target feature mixed vector, so that the learning capacity is high, and the final output training result is more and more accurate along with the gradual expansion of the training samples.
And 140, correcting the target text based on the target replacing character set.
In this step, after the target replacement character set is generated, the target replacement character set can be displayed for the user to select, and the user selects one of the target replacement character sets as the error correction result of the target position in the target text, so that the terminal replaces the target character in the target text with the target replacement character selected by the user to complete the error correction of the target text. According to the text error correction method provided by the embodiment of the application, the target replacement character set corresponding to the target error position is generated from multiple dimensions, so that the accuracy, the precision and the comprehensiveness of the target replacement characters contained in the target replacement character set can be improved; and correcting the target text based on the target replacement character set, and preferentially displaying the target replacement characters with higher accuracy to the user on the basis of providing more target replacement characters to the user, thereby being beneficial to improving the error correction efficiency.
It is noted that in some embodiments, as shown in fig. 2, before step 110, the method may further include:
receiving a second input of the user;
in response to the second input, an error correction mode is determined.
In this embodiment, the error correction mode includes an automatic error correction mode and a manual error correction mode.
The second input is used to determine an error correction mode.
The second input may be the same touch input, physical key input, voice input or character input as the first input, which will be described in the following embodiments and will not be described herein.
In an actual implementation process, the first input may further include a first sub-input and a second sub-input, where the first sub-input is used for displaying an error correction interface, and the second sub-input is used for determining an error correction mode.
For example, the user clicks an error correction target control in the input method for entering an "error correction" interface to achieve the first sub-input. And the terminal responds to the first sub-input, displays an error correction interface and enters an error correction word output mode.
The user clicks a mode selection control for selecting the error correction mode on the error correction interface as shown in fig. 6 to implement the second sub-input, thereby entering the corresponding error correction mode.
If the user clicks the automatic error correction mode control on the error correction interface shown in fig. 6, the terminal determines the error correction mode as the automatic error correction mode; or the user clicks a manual error correction mode control on the error correction interface as shown in fig. 6, the terminal determines the error correction mode as the manual error correction mode.
In the embodiment, by providing multiple error correction modes such as automatic error correction or user-defined error correction and supporting the user to select the error correction mode by himself, the user can select the optimal error correction mode based on the actual situation, thereby obviously improving the flexibility and universality of error correction.
With continued reference to fig. 2, a specific description will be given below of the implementation of step 110 in the embodiment of the present application, from two different implementation perspectives.
Automatically determining the error position of the target
In this embodiment, step 110 may include:
segmenting a target text to generate a word vector set;
calculating the error probability of each character in the word vector set;
and determining the position corresponding to the character as a target error position under the condition that the error probability is greater than the target threshold value.
In this embodiment, the word vector set is a collection of vectors, denoted by eerror sequence, that includes a single word or word in the target text.
In actual implementation, the target text may be segmented by words. For example, the target text "very expensive today's weather" may be segmented to include: "today", "weather", "true", "expensive", etc. word vectors.
Then modeling EerrorSentence by using a MacBert language model to obtain a semantic representation H ═ H1, …, Hn ]; h is a set of semantic representations corresponding to each word vector obtained by segmenting the target text, and n is the number of the word vectors obtained by segmenting the target text.
As shown in fig. 5, each position of H is processed by using a Multilayer Perceptron (MLP) to obtain a probability Error of whether the character at each position is wrong, where the Error belongs to R2; wherein Error [0] represents the correct probability that the character at the current position is correct, and the correct probability is used for representing; error [1] indicates the probability that the character at the current position is erroneous, expressed as the Error probability.
When Error [1] > Error [0], that is, when the Error probability is greater than the target threshold, the position corresponding to the character is determined as the target Error position, and the Error position set of the target Error position income Candidate is recorded as Candidate.
Wherein the target threshold is not less than 50%.
For example, the terminal respectively calculates error probabilities of characters in a plurality of word vectors such as "today", "weather", "true", "huge", and determines the position of "huge" in the target text "true weather" as the target error position when the calculated error probability corresponding to "huge" exceeds the correct probability.
In the embodiment, the target error position is detected and determined through the MacBert function, and the accuracy is high.
Second, the user manually determines the error position of the target
In this embodiment, step 110 may include:
receiving a first input of a target text by a user;
in response to the first input, a target error location is determined.
In this embodiment, the first input is used to determine a target error location.
Wherein, the first input may be at least one of the following modes:
first, the first input may be a touch operation, including but not limited to a click operation, a slide operation, a press operation, and the like.
In this embodiment, the receiving of the first input by the user may be receiving a touch operation of the user on a display area of a display screen of the terminal.
For example, in the interface state of displaying the target text, a target control corresponding to the position of each character in the target text is displayed on the current interface, and the first input can be realized by touching the target control; or setting the first input as a continuous multi-tap or long-press operation of the target position of the display area within the target time interval.
Second, the first input may be a physical key input.
In this embodiment, the body of the terminal is provided with physical keys, such as a mouse or a keyboard, for receiving a first input of a user, which may be a first input of receiving a movement of the user and pressing a corresponding physical key; the first input may also be a combined operation of pressing a plurality of physical keys simultaneously.
Third, the first input may be a voice input.
In this embodiment, the terminal may determine a position where "too much" is located as a target error position when receiving a voice such as "too much weather is wrong".
Of course, in other embodiments, the first input may also be in other forms, including but not limited to character input, and the like, which may be determined according to actual needs, and this is not limited in this application.
In the actual implementation process, when the user inputs the target text "very expensive today's weather", the user presses the position of the "very expensive" in the target text for a long time in the error correction edit box shown in fig. 6, and the first input can be realized.
And the terminal responds to the first input and determines the position corresponding to the 'wide' as the target error position, thereby realizing the user-defined error position.
According to the text error correction method provided by the embodiment of the application, two modes of automatically determining the target error position and manually determining the target error position by a user are provided, the automatic or user-defined selection of the error position is supported, the user can select the optimal error correction mode based on the actual situation, the situation that the user cannot correct errors by himself under the condition that the target text is in errors and the terminal does not automatically recognize the errors is avoided, and therefore the flexibility and universality of error correction are remarkably improved.
According to the text error correction method provided by the embodiment of the application, the execution main body can be a text error correction device. In the embodiment of the present application, a method for executing text error correction by a text error correction device is taken as an example, and the text error correction device provided in the embodiment of the present application is described.
The embodiment of the application also provides a text error correction device.
As shown in fig. 7, the text correction apparatus includes: a first determination module 710, a first processing module 720, a second processing module 730, and a third processing module 740.
A first determining module 710, configured to determine a target error location from the target text;
the first processing module 720 is configured to process the target character at the target error position, and generate a semantic replacement character set, a phonetic replacement character set, and a shape replacement character set corresponding to the target character;
the second processing module 730 is configured to generate a target replacement character set corresponding to a target character based on the semantic replacement character set, the phonetic-approximation replacement character set, and the shape-approximation replacement character set;
a third processing module 740, configured to correct errors of the target text based on the target replacement character set;
the replaced characters in the semantic replaced character set are characters with similar semantics with the target characters, the replaced characters in the phonetic close replaced character set are characters with similar pinyin with the target characters, and the replaced characters in the shape close replaced character set are characters with similar font with the target characters.
According to the text error correction device provided by the embodiment of the application, the target replacement character set corresponding to the target error position is generated from multiple dimensions, so that the accuracy, the precision and the comprehensiveness of target replacement characters contained in the target replacement character set can be improved; and correcting the target text based on the target replacement character set, and preferentially displaying the target replacement characters with higher accuracy to the user on the basis of providing more target replacement characters to the user, thereby being beneficial to improving the error correction efficiency.
In some embodiments, the semantic replacement character set includes a first semantic replacement character set and a second semantic replacement character set, and the first processing module 720 is further configured to:
generating a plurality of initial semantic replacing characters corresponding to the target error positions and generating probabilities corresponding to the initial semantic replacing characters based on the target word list;
determining N initial semantic replacement characters based on the generation probability, and sequencing the N initial semantic replacement characters based on the generation probability to generate a first semantic replacement character set;
replacing target characters corresponding to the target error positions with first semantic replacing characters in the first semantic replacing character set respectively;
generating a replaced replacement score based on the conditional probability corresponding to the first semantic replacement character;
and sequencing the first semantic replacement characters in the first semantic replacement character set based on the replacement scores to generate a second semantic replacement character set.
In some embodiments, the second processing module 730 may further be configured to:
extracting target semantic replacing characters from the semantic replacing character set, extracting target phonetic replacing characters from the phonetic replacing character set, and extracting target shape replacing characters from the shape replacing character set;
generating a target feature fusion vector based on target replacement characters, wherein the target replacement characters comprise target semantic replacement characters, target phonetic replacement characters and target shape replacement characters;
and generating a target replacement character set corresponding to the target character based on the target feature fusion vector.
In some embodiments, the second processing module 730 may further be configured to:
acquiring a target semantic score of a target semantic replacing character, a target phonetic near score of the target phonetic near replacing character and a target shape near score of the target shape near replacing character;
and generating a target feature fusion vector based on the target semantic replacing characters, the target semantic scores, the target phonetic approximation replacing characters, the target phonetic approximation scores, the target shape approximation replacing characters and the target shape approximation scores.
In some embodiments, the second processing module 730 may further be configured to:
generating a target score corresponding to the target replacement character based on the target feature fusion vector;
and sequencing the target replacement characters based on the target scores to generate a target replacement character set corresponding to the target characters.
In some embodiments, the apparatus may further comprise:
the fourth processing module is used for segmenting the target text to generate a word vector set;
the fifth processing module is used for calculating the error probability of each character in the word vector set;
the first determining module 710 is further configured to determine a position corresponding to the character as a target error position if the error probability is greater than the target threshold.
In some embodiments, the apparatus may further comprise:
the first receiving module is used for receiving a first input of a target text by a user;
the first determining module 710 is further configured to determine a target error location in response to the first input.
The text error correction device in the embodiment of the present application may be an electronic device, or may be a component in an electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be a device other than a terminal. The electronic Device may be, for example, a Mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic Device, a Mobile Internet Device (MID), an Augmented Reality (AR)/Virtual Reality (VR) Device, a robot, a wearable Device, an ultra-Mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and may also be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine, a self-service machine, and the like, and the embodiments of the present application are not particularly limited.
The text correction device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an IOS operating system, or other possible operating systems, which is not specifically limited in the embodiments of the present application.
The text error correction device provided in the embodiment of the present application can implement each process implemented by the method embodiments of fig. 1 to fig. 6, and is not described here again to avoid repetition.
Optionally, as shown in fig. 8, an electronic device 800 is further provided in this embodiment of the present application, and includes a processor 801, a memory 802, and a program or an instruction stored in the memory 802 and executable on the processor 801, where the program or the instruction is executed by the processor 801 to implement each process of the text error correction method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.
Fig. 9 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.
The electronic device 900 includes, but is not limited to: a radio frequency unit 901, a network module 902, an audio output unit 903, an input unit 904, a sensor 905, a display unit 906, a user input unit 907, an interface unit 908, a memory 909, and a processor 910.
Those skilled in the art will appreciate that the electronic device 900 may further include a power source (e.g., a battery) for supplying power to various components, and the power source may be logically connected to the processor 910 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system. The electronic device structure shown in fig. 9 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is not repeated here.
Wherein the processor 910 is configured to:
determining a target error position from the target text;
processing the target character at the target error position to generate a semantic replacement character set, a phonetic replacement character set and a shape replacement character set corresponding to the target character;
generating a target replacement character set corresponding to the target character based on the semantic replacement character set, the phonetic replacement character set and the shape replacement character set;
correcting the target text based on the target replacement character set;
the replaced characters in the semantic replaced character set are characters with similar semantics with the target characters, the replaced characters in the phonetic close replaced character set are characters with similar pinyin with the target characters, and the replaced characters in the shape close replaced character set are characters with similar font with the target characters.
According to the electronic equipment provided by the embodiment of the application, the target replacement character set corresponding to the target error position is generated from multiple dimensions, so that the accuracy, the precision and the comprehensiveness of the target replacement characters contained in the target replacement character set can be improved; and correcting the target text based on the target replacement character set, and preferentially displaying the target replacement characters with higher accuracy to the user on the basis of providing more target replacement characters to the user, thereby being beneficial to improving the error correction efficiency.
Optionally, the semantic replacement character set includes a first semantic replacement character set and a second semantic replacement character set, and processor 910 is further configured to:
generating a plurality of initial semantic replacing characters corresponding to the target error positions and generating probabilities corresponding to the initial semantic replacing characters based on the target word list;
determining N initial semantic replacement characters based on the generation probability, sequencing the N initial semantic replacement characters based on the generation probability, and generating a first semantic replacement character set;
replacing target characters corresponding to the target error positions with first semantic replacing characters in the first semantic replacing character set respectively;
generating a replaced replacement score based on the conditional probability corresponding to the first semantic replacement character;
and sequencing the first semantic replacement characters in the first semantic replacement character set based on the replacement scores to generate a second semantic replacement character set.
Optionally, the processor 910 may further be configured to:
extracting target semantic replacing characters from the semantic replacing character set, extracting target phonetic replacing characters from the phonetic replacing character set, and extracting target shape replacing characters from the shape replacing character set;
generating a target feature fusion vector based on target replacement characters, wherein the target replacement characters comprise target semantic replacement characters, target phonetic replacement characters and target shape replacement characters;
and generating a target replacement character set corresponding to the target character based on the target feature fusion vector.
Optionally, the processor 910 may further be configured to:
acquiring a target semantic score of a target semantic replacing character, a target phonetic near score of the target phonetic near replacing character and a target shape near score of the target shape near replacing character;
and generating a target feature fusion vector based on the target semantic replacing characters, the target semantic scores, the target phonetic approximation replacing characters, the target phonetic approximation scores, the target shape approximation replacing characters and the target shape approximation scores.
Optionally, the processor 910 may further be configured to:
generating a target score corresponding to the target replacement character based on the target feature fusion vector;
and sequencing the target replacement characters based on the target scores to generate a target replacement character set corresponding to the target characters.
Optionally, the processor 910 may further be configured to:
segmenting a target text to generate a word vector set;
calculating the error probability of each character in the word vector set;
and determining the position corresponding to the character as a target error position under the condition that the error probability is greater than the target threshold value.
Alternatively,
a user input unit 907 for receiving a first input of a target text by a user;
the processor 910 may also be configured to determine a target error location in response to the first input.
It should be understood that, in the embodiment of the present application, the input Unit 904 may include a Graphics Processing Unit (GPU) 9041 and a microphone 9042, and the Graphics Processing Unit 9041 processes image data of a still picture or a video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 906 may include a display panel 9061, and the display panel 9061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 907 includes at least one of a touch panel 9071 and other input devices 9072. A touch panel 9071 also referred to as a touch screen. The touch panel 9071 may include two parts, a touch detection device and a touch controller. Other input devices 9072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.
The memory 909 may be used to store software programs as well as various data. The memory 909 may mainly include a first storage area storing a program or an instruction and a second storage area storing data, wherein the first storage area may store an operating system, an application program or an instruction (such as a sound playing function, an image playing function, and the like) required for at least one function, and the like. Further, the memory 909 may include volatile memory or nonvolatile memory, or the memory 909 may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. The volatile Memory may be a Random Access Memory (RAM), a Static Random Access Memory (Static RAM, SRAM), a Dynamic Random Access Memory (Dynamic RAM, DRAM), a Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (Double Data Rate SDRAM, ddr SDRAM), an Enhanced Synchronous SDRAM (ESDRAM), a Synchronous Link DRAM (SLDRAM), and a Direct Memory bus RAM (DRRAM). The memory 909 in the embodiments of the subject application includes, but is not limited to, these and any other suitable types of memory.
Processor 910 may include one or more processing units; optionally, the processor 910 integrates an application processor, which primarily handles operations involving the operating system, user interface, and applications, and a modem processor, which primarily handles wireless communication signals, such as a baseband processor. It is to be appreciated that the modem processor described above may not be integrated into processor 910.
The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the text error correction method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read only memory ROM, a random access memory RAM, a magnetic or optical disk, and the like.
The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the text error correction method embodiment, and can achieve the same technical effect, and the details are not repeated here to avoid repetition.
It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (14)

1. A text error correction method, comprising:
determining a target error position from the target text;
processing the target character at the target error position to generate a semantic replacement character set, a phonetic replacement character set and a shape replacement character set corresponding to the target character;
generating a target replacement character set corresponding to the target character based on the semantic replacement character set, the phonetic replacement character set and the shape replacement character set;
correcting the target text based on the target replacement character set;
the alternative characters in the semantic alternative character set are characters with similar semantics to the target characters, the alternative characters in the phonetic alternative character set are characters with similar pinyin to the target characters, and the alternative characters in the shape similar alternative character set are characters with similar font to the target characters.
2. The text error correction method according to claim 1, wherein the semantic replacement character set includes a first semantic replacement character set and a second semantic replacement character set, and the processing the target character at the target error position to generate a semantic replacement character set, a phonetic replacement character set, and a shape-similar replacement character set corresponding to the target character comprises:
generating a plurality of initial semantic replacing characters corresponding to the target error position and a generating probability corresponding to the initial semantic replacing characters based on a target word list;
determining N initial semantic replacement characters based on the generation probability, and sequencing the N initial semantic replacement characters based on the generation probability to generate the first semantic replacement character set;
replacing the target characters corresponding to the target error positions with first semantic replacing characters in the first semantic replacing character set respectively;
generating a replaced replacement score based on the conditional probability corresponding to the first semantic replacement character;
and sequencing the first semantic replacement characters in the first semantic replacement character set based on the replacement scores to generate the second semantic replacement character set.
3. The text error correction method according to claim 1, wherein generating a target replacement character set corresponding to the target character based on the semantic replacement character set, the phonetic replacement character set and the shape-approximate replacement character set comprises:
extracting target semantic replacing characters from the semantic replacing character set, extracting target phonetic replacing characters from the phonetic replacing character set, and extracting target shape near replacing characters from the shape near replacing character set;
generating a target feature fusion vector based on target replacement characters, wherein the target replacement characters comprise the target semantic replacement characters, the target phonetic replacement characters and the target shape replacement characters;
and generating a target replacement character set corresponding to the target character based on the target feature fusion vector.
4. The text error correction method of claim 3 wherein generating the target feature fusion vector based on the target replacement character comprises:
acquiring a target semantic score of the target semantic replacement character, a target phonetic score of the target phonetic replacement character and a target shape score of the target shape near replacement character;
generating the target feature fusion vector based on the target semantic replacement character, the target semantic score, the target phonetic near replacement character, the target shape near replacement character and the target shape near score.
5. The method for correcting text errors according to claim 3, wherein the generating a target replacement character set corresponding to the target character based on the target feature fusion vector comprises:
generating a target score corresponding to the target replacement character based on the target feature fusion vector;
and sequencing the target replacement characters based on the target scores to generate a target replacement character set corresponding to the target characters.
6. The text error correction method of any one of claims 1-5, wherein the determining the target error location from the target text comprises:
segmenting the target text to generate a word vector set;
calculating an error probability for each character in the set of word vectors;
determining the position corresponding to the character as the target error position under the condition that the error probability is greater than a target threshold value;
or,
receiving a first input of the target text by a user;
in response to the first input, determining the target error location.
7. A text correction apparatus, comprising:
the first determining module is used for determining a target error position from the target text;
the first processing module is used for processing the target character at the target error position to generate a semantic replacement character set, a phonetic replacement character set and a shape replacement character set corresponding to the target character;
the second processing module is used for generating a target replacing character set corresponding to the target character based on the semantic replacing character set, the phonetic near replacing character set and the shape near replacing character set;
a third processing module, configured to correct the error of the target text based on the target replacement character set;
the alternative characters in the semantic alternative character set are characters with similar semantics to the target characters, the alternative characters in the phonetic alternative character set are characters with similar pinyin to the target characters, and the alternative characters in the shape similar alternative character set are characters with similar font to the target characters.
8. The text correction device of claim 7, wherein the semantic replacement character set comprises a first semantic replacement character set and a second semantic replacement character set, and the first processing module is further configured to:
generating a plurality of initial semantic replacing characters corresponding to the target error position and a generating probability corresponding to the initial semantic replacing characters based on a target word list;
determining N initial semantic replacement characters based on the generation probability, and sequencing the N initial semantic replacement characters based on the generation probability to generate the first semantic replacement character set;
replacing the target characters corresponding to the target error positions with first semantic replacing characters in the first semantic replacing character set respectively;
generating a replaced replacement score based on the conditional probability corresponding to the first semantic replacement character;
and sequencing the first semantic replacement characters in the first semantic replacement character set based on the replacement scores to generate the second semantic replacement character set.
9. The text correction apparatus of claim 7, wherein the second processing module is further configured to:
extracting target semantic replacing characters from the semantic replacing character set, extracting target phonetic replacing characters from the phonetic replacing character set, and extracting target shape near replacing characters from the shape near replacing character set;
generating a target feature fusion vector based on target replacement characters, wherein the target replacement characters comprise the target semantic replacement characters, the target phonetic replacement characters and the target shape replacement characters;
and generating a target replacement character set corresponding to the target character based on the target feature fusion vector.
10. The text correction apparatus of claim 9, wherein the second processing module is further configured to:
acquiring a target semantic score of the target semantic replacing character, a target phonetic near score of the target phonetic near replacing character and a target shape near score of the target shape near replacing character;
generating the target feature fusion vector based on the target semantic replacement character, the target semantic score, the target phonetic near replacement character, the target shape near replacement character and the target shape near score.
11. The text correction apparatus of claim 9, wherein the second processing module is further configured to:
respectively generating a target score corresponding to the target semantic replacing character, a target score corresponding to the target phonetic approximate replacing character and a target score corresponding to the target shape approximate replacing character based on the target feature fusion vector;
generating a target score corresponding to the target replacement character based on the target feature fusion vector;
and sequencing the target replacement characters based on the target scores to generate a target replacement character set corresponding to the target characters.
12. The text correction apparatus according to any one of claims 7 to 11, further comprising:
the fourth processing module is used for segmenting the target text to generate a word vector set;
a fifth processing module, configured to calculate an error probability of each character in the word vector set;
the first determining module is further configured to determine a position corresponding to the character as the target error position when the error probability is greater than a target threshold;
or, further comprising:
the first receiving module is used for receiving a first input of the target text by a user;
the first determining module is further configured to determine the target error location in response to the first input.
13. An electronic device comprising a processor and a memory, the memory storing a program or instructions executable on the processor, the program or instructions when executed by the processor implementing the text correction method of any one of claims 1-6.
14. A readable storage medium, on which a program or instructions are stored, which when executed by a processor, implement the text correction method according to any one of claims 1-6.
CN202210134582.7A 2022-02-14 2022-02-14 Text error correction method, text error correction device and electronic equipment Pending CN114510926A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210134582.7A CN114510926A (en) 2022-02-14 2022-02-14 Text error correction method, text error correction device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210134582.7A CN114510926A (en) 2022-02-14 2022-02-14 Text error correction method, text error correction device and electronic equipment

Publications (1)

Publication Number Publication Date
CN114510926A true CN114510926A (en) 2022-05-17

Family

ID=81552197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210134582.7A Pending CN114510926A (en) 2022-02-14 2022-02-14 Text error correction method, text error correction device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114510926A (en)

Similar Documents

Publication Publication Date Title
US10156981B2 (en) User-centric soft keyboard predictive technologies
CN109032375B (en) Candidate text sorting method, device, equipment and storage medium
CN110019732B (en) Intelligent question answering method and related device
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN110415679B (en) Voice error correction method, device, equipment and storage medium
CN110750959A (en) Text information processing method, model training method and related device
CN104808806A (en) Chinese character input method and device in accordance with uncertain information
CN108399914A (en) A kind of method and apparatus of speech recognition
CN112631437A (en) Information recommendation method and device and electronic equipment
CN114818665B (en) Multi-purpose recognition method and system based on bert+bilstm+crf and xgboost model
CN114154459A (en) Speech recognition text processing method and device, electronic equipment and storage medium
CN115525757A (en) Contract abstract generation method and device and contract key information extraction model training method
CN117708351B (en) Deep learning-based technical standard auxiliary review method, system and storage medium
CN114328798A (en) Processing method, device, equipment, storage medium and program product for searching text
CN113658690A (en) Intelligent medical guide method and device, storage medium and electronic equipment
CN112559725A (en) Text matching method, device, terminal and storage medium
CN116991252A (en) Input text prediction method and device, electronic equipment and storage medium
CN112558784A (en) Method and device for inputting characters and electronic equipment
CN116956915A (en) Entity recognition model training method, device, equipment, storage medium and product
CN110929514A (en) Text proofreading method and device, computer readable storage medium and electronic equipment
CN114510926A (en) Text error correction method, text error correction device and electronic equipment
CN112882680B (en) Speech recognition method and device
CN111651599B (en) Method and device for ordering voice recognition candidate results
CN114896382A (en) Artificial intelligent question-answering model generation method, question-answering method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination