CN110929502A - Text error detection method and device - Google Patents

Text error detection method and device Download PDF

Info

Publication number
CN110929502A
CN110929502A CN201811006028.0A CN201811006028A CN110929502A CN 110929502 A CN110929502 A CN 110929502A CN 201811006028 A CN201811006028 A CN 201811006028A CN 110929502 A CN110929502 A CN 110929502A
Authority
CN
China
Prior art keywords
probability
characters
suspected
text
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811006028.0A
Other languages
Chinese (zh)
Other versions
CN110929502B (en
Inventor
张占秋
李帅
王伟玮
王杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Didi Infinity Technology and Development Co Ltd
Original Assignee
Beijing Didi Infinity Technology and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology and Development Co Ltd filed Critical Beijing Didi Infinity Technology and Development Co Ltd
Priority to CN201811006028.0A priority Critical patent/CN110929502B/en
Publication of CN110929502A publication Critical patent/CN110929502A/en
Application granted granted Critical
Publication of CN110929502B publication Critical patent/CN110929502B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a text error detection method and a text error detection device, wherein a corpus storing correct texts is used for primarily screening suspected wrong characters and suspected wrong vocabularies in a text to be detected, then target suspected words with higher accuracy are screened from the vocabularies to which the suspected wrong characters belong, and finally, the probability of each target suspected word appearing at the current position of the text to be detected is used for screening from the target suspected words to obtain the final target wrong characters. The embodiment of the application obtains the suspected wrong characters and the suspected wrong vocabularies through preliminary screening, further performs the processing of selecting intersection and probability screening, and effectively improves the accuracy of text error detection. Meanwhile, the text error detection method and device provided by the embodiment of the application can detect various types of texts to be detected by expanding or updating the corpus, and have strong adaptability.

Description

Text error detection method and device
Technical Field
The present application relates to the field of text processing technologies, and in particular, to a text error detection method and apparatus.
Background
With the development of science and technology, in the scene of intelligent service, operations such as semantic understanding and intention classification need to be performed on the dialog text of a user or customer service, and then corresponding operations are executed according to the obtained semantics or intentions. At present, wrongly written characters exist in texts obtained by manual writing, input by an input method or voice recognition, and the wrongly written characters bring great difficulty to the semantic understanding and intention classification, so that the accuracy of subsequent semantic understanding or intention classification is seriously influenced, and the service quality of intelligent service is damaged.
Some methods for detecting errors of texts exist in the prior art, but the error detection methods have the defects of low error detection accuracy or poor applicability, for example, some text error detection methods are only applicable to some texts, and the error detection accuracy of other texts is very low.
Disclosure of Invention
In view of the above, the present application aims to provide a text error detection method and apparatus to improve the error detection accuracy and adaptability of texts.
In a first aspect, an embodiment of the present application provides a text error detection method, including:
based on the corpus in which the correct text is stored, screening suspected wrong words and suspected wrong characters from the text to be detected;
obtaining the vocabulary to which each suspected error character belongs from the text to be detected, and screening the vocabulary to which the suspected error character belongs from the vocabulary to which the suspected error character belongs to obtain a target suspected vocabulary;
and screening target error characters from the target suspected vocabulary based on the probability of each target suspected vocabulary appearing at the current position of the text to be detected.
With reference to the first aspect, this embodiment provides a first possible implementation manner of the first aspect, where the screening target wrong characters from the target suspected vocabulary includes:
screening target wrong words from the target suspected words based on the probability of each target suspected word appearing at the current position of the text to be detected;
and screening characters belonging to the suspected error characters from all characters of the target error vocabulary to obtain the target error characters.
With reference to the first possible implementation manner of the first aspect, this example provides a second possible implementation manner of the first aspect, where the screening a target wrong vocabulary from the target suspected vocabulary includes:
determining the probability of each target suspected vocabulary appearing at the current position of the text to be detected to obtain a first probability of each target suspected vocabulary;
and screening the target suspected vocabulary with the first probability smaller than a first preset value to obtain the target error vocabulary.
With reference to the second possible implementation manner of the first aspect, an embodiment of the present application provides a third possible implementation manner of the first aspect, wherein determining the first probability includes:
acquiring a previous word of the target suspected word in the text to be detected to obtain a first word;
acquiring a latter vocabulary of the target suspected vocabulary in the text to be detected to obtain a second vocabulary;
and determining the probability of the common occurrence of the target suspected vocabulary, the first vocabulary and the second vocabulary according to the probability of the common occurrence of every two vocabularies in the corpus to obtain the first probability.
With reference to the third possible implementation manner of the first aspect, the present application provides a fourth possible implementation manner of the first aspect, where a probability that every two words in the corpus occur together is specifically a probability that one of the words appears after another word;
determining the first probability includes determining that the first probability includes,
and determining the probability that the target suspected vocabulary appears behind the first vocabulary and the second vocabulary appears behind the target suspected vocabulary according to the probability that every two vocabularies in the corpus appear together.
With reference to the first aspect, the first possible implementation manner of the first aspect, the second possible implementation manner of the first aspect, the third possible implementation manner of the first aspect, or the fourth possible implementation manner of the first aspect, an example of the present application provides a fifth possible implementation manner of the first aspect, where the screening for suspected erroneous vocabulary and suspected erroneous characters from a text to be detected includes:
acquiring the corpus and a text to be detected;
and screening suspected wrong vocabularies and suspected wrong characters from the text to be detected based on the probability of the common occurrence of every two characters and the probability of the common occurrence of every two vocabularies in the corpus.
With reference to the fifth possible implementation manner of the first aspect, an embodiment of the present application provides a sixth possible implementation manner of the first aspect, where the screening suspected erroneous characters from the text to be detected includes:
acquiring M characters of the text to be detected from the Nth character to obtain at least one first set; wherein N is more than or equal to 1, and N is less than or equal to L-M-1, and L represents the number of characters of the text of the object to be detected;
determining the probability of the co-occurrence of the M characters in the first set according to the probability of the co-occurrence of every two characters in the corpus to obtain a second probability;
and screening the suspected wrong characters from the first set based on the second probability.
With reference to the sixth possible implementation manner of the first aspect, this application example provides a seventh possible implementation manner of the first aspect, where the screening the suspected erroneous characters from the first set includes:
and screening a first set with the second probability smaller than a second preset value, and acquiring all characters from the screened first set to obtain the suspected wrong characters.
With reference to the sixth possible implementation manner of the first aspect, an embodiment of the present application provides an eighth possible implementation manner of the first aspect, where a probability that every two characters in the corpus occur together is specifically a probability that one of the characters occurs after another character;
determining the second probability includes determining that the second probability includes,
and determining the probability of the co-occurrence of the M characters in the first set according to a first predetermined sequence according to the probability of the co-occurrence of every two characters in the corpus to obtain the second probability, wherein the first predetermined sequence is used for representing the sequence of the M characters in the first set in the text to be detected.
With reference to the fifth possible implementation manner of the first aspect, an example of the present application provides a ninth possible implementation manner of the first aspect, where the screening suspected wrong words from the text to be detected includes:
obtaining Q vocabularies of the text to be detected from the P-th vocabulary, and obtaining at least one second set; p is more than or equal to 1 and less than or equal to K-Q-1, and K represents the number of words in the text of the object to be detected;
determining the probability of the common occurrence of Q words in the second set according to the probability of the common occurrence of every two words in the corpus to obtain a third probability;
and screening the suspected wrong vocabulary from the second set based on the third probability.
With reference to the ninth possible implementation manner of the first aspect, this example provides a tenth possible implementation manner of the first aspect, where the screening the suspected erroneous character vocabulary from the second set includes:
and screening a second set with the third probability smaller than a third preset value, and acquiring all words from the screened second set to obtain the suspected wrong words.
With reference to the ninth possible implementation manner of the first aspect, the present application provides an eleventh possible implementation manner of the first aspect, where a probability that every two words in the corpus occur together is specifically a probability that one of the words occurs after another of the words;
determining the third probability comprises:
and determining the probability of the Q words in the third set appearing together according to a second preset sequence according to the probability of the two words in the corpus appearing together to obtain the third probability, wherein the second preset sequence is the sequence of the Q words in the second set in the text to be detected.
With reference to the fifth possible implementation manner of the first aspect, an embodiment of the present application provides a twelfth possible implementation manner of the first aspect, where the method further includes:
preprocessing all texts in the corpus;
determining the probability of common occurrence of every two characters in the preprocessed corpus;
determining the probability of the common occurrence of every two vocabularies in the preprocessed corpus;
and screening suspected wrong vocabularies and suspected wrong characters from the text to be detected based on the probability of common occurrence of every two characters and the probability of common occurrence of every two vocabularies in the preprocessed corpus.
With reference to the twelfth possible implementation manner of the first aspect, this embodiment provides a thirteenth possible implementation manner of the first aspect, where the determining a probability that every two characters in the preprocessed corpus occur together includes:
and determining the co-occurrence frequency of every two characters in the preprocessed corpus, and determining the co-occurrence frequency of every two characters according to the obtained frequency.
With reference to the thirteenth possible implementation manner of the first aspect, this application example provides a fourteenth possible implementation manner of the first aspect, where the frequency of the common occurrence of every two characters is specifically that one of the characters occurs at a frequency after the other character.
With reference to the thirteenth possible implementation manner of the first aspect, an embodiment of the present application provides a fifteenth possible implementation manner of the first aspect, where the method further includes:
and adding the co-occurrence frequency of every two characters in the preprocessed corpus to a fourth preset value to obtain the updated co-occurrence frequency of every two characters.
With reference to the twelfth possible implementation manner of the first aspect, this embodiment provides a sixteenth possible implementation manner of the first aspect, where the determining a probability that every two words in the preprocessed corpus occur together includes:
and determining the co-occurrence frequency of every two words in the preprocessed corpus, and determining the co-occurrence frequency of every two words according to the obtained frequency.
With reference to the sixteenth possible implementation manner of the first aspect, the present application provides a seventeenth possible implementation manner of the first aspect, where the frequency of occurrence of each two vocabularies is specifically that one of the vocabularies occurs at a later frequency than the other vocabulary.
With reference to the sixteenth possible implementation manner of the first aspect, an embodiment of the present application provides an eighteenth possible implementation manner of the first aspect, where the method further includes:
and adding the co-occurrence frequency of every two words in the preprocessed corpus to a fifth preset value to obtain the updated co-occurrence frequency of every two words.
With reference to the twelfth possible implementation manner of the first aspect, this application provides a nineteenth possible implementation manner of the first aspect, where the corpus includes at least one text;
the preprocessing of all texts in the corpus comprises:
adding a first predetermined character before a first character of each text;
adding a second predetermined character after the last character of each text;
replacing all characters except the Chinese character in each text with a third preset character;
replacing a consecutive plurality of third predetermined characters with one of said third predetermined characters.
With reference to the nineteenth possible implementation manner of the first aspect, the present application provides a twentieth possible implementation manner of the first aspect, where the method further includes:
performing word segmentation on the preprocessed text to obtain a plurality of words of the corpus;
and determining the probability of the common occurrence of every two words in the preprocessed corpus based on the words in the corpus.
With reference to the nineteenth possible implementation manner of the first aspect or the twenty-second possible implementation manner of the first aspect, an embodiment of the present application provides a twenty-first possible implementation manner of the first aspect, where the method further includes a step of preprocessing the text to be detected:
adding the first preset character before the first character of the text to be detected;
adding the second predetermined character after the last character of the text to be detected;
replacing all characters except the Chinese character in the text to be detected with a third preset character;
replacing a consecutive plurality of third predetermined characters with one of said third predetermined characters.
With reference to the twenty-first possible implementation manner of the first aspect, an embodiment of the present application provides a twenty-second possible implementation manner of the first aspect, where the method further includes:
performing word segmentation processing on the text to be detected to obtain a plurality of words of the text to be detected;
and screening the suspected wrong vocabulary based on the vocabulary of the text to be detected and the corpus.
In a second aspect, an embodiment of the present application provides a text error detection apparatus, including:
the first screening module is used for screening suspected wrong words and suspected wrong characters from the text to be detected based on the corpus in which the correct text is stored;
the second screening module is used for acquiring the vocabulary to which each suspected wrong character belongs from the text to be detected, and screening the vocabulary to which the suspected wrong character belongs from the vocabulary to which the suspected wrong character belongs to obtain a target suspected vocabulary;
and the third screening module is used for screening target wrong characters from the target suspected vocabulary based on the probability that each target suspected vocabulary appears at the current position of the text to be detected.
With reference to the second aspect, the present application provides a first possible implementation manner of the second aspect, where the third screening module includes:
the target wrong vocabulary screening submodule is used for screening target wrong vocabularies from the target suspected vocabularies based on the probability that each target suspected vocabulary appears at the current position of the text to be detected;
and the target error character screening submodule is used for screening characters belonging to the suspected error characters from all characters of the target error vocabulary to obtain the target error characters.
With reference to the first possible implementation manner of the second aspect, an embodiment of the present application provides a second possible implementation manner of the second aspect, where the target wrong vocabulary screening sub-module is specifically configured to determine a probability that each target suspected vocabulary appears at the current position of the text to be detected, obtain a first probability of each target suspected vocabulary, and screen a target suspected vocabulary having a first probability smaller than a first predetermined value, so as to obtain the target wrong vocabulary.
With reference to the second possible implementation manner of the second aspect, an embodiment of the present application provides a third possible implementation manner of the second aspect, where the target wrong vocabulary screening submodule, when determining the first probability, is further specifically configured to obtain a previous vocabulary of the target suspected vocabulary in the text to be detected, obtain the first vocabulary, obtain a next vocabulary of the target suspected vocabulary in the text to be detected, obtain the second vocabulary, and determine, according to a probability that every two vocabularies in the corpus occur together, a probability that the target suspected vocabulary, the first vocabulary, and the second vocabulary occur together, so as to obtain the first probability.
With reference to the third possible implementation manner of the second aspect, the present embodiment provides a fourth possible implementation manner of the second aspect, where a probability that every two words in the corpus occur together is specifically a probability that one of the words will appear after another word;
when determining the first probability, the target wrong vocabulary screening submodule is further specifically configured to determine, according to a probability that every two vocabularies in the corpus occur together, a probability that the target suspected vocabulary appears behind the first vocabulary and a probability that the second vocabulary appears behind the target suspected vocabulary.
With reference to the second aspect, the first possible implementation manner of the second aspect, the second possible implementation manner of the second aspect, the third possible implementation manner of the second aspect, or the fourth possible implementation manner of the second aspect, an embodiment of the present application provides a fifth possible implementation manner of the second aspect, where the first screening module includes:
the acquisition submodule is used for acquiring the corpus and the text to be detected;
and the suspected error screening submodule is used for screening suspected error vocabularies and suspected error characters from the text to be detected based on the probability that every two characters in the corpus appear together and the probability that every two vocabularies appear together.
With reference to the fifth possible implementation manner of the second aspect, an embodiment of the present application provides a sixth possible implementation manner of the second aspect, where the suspected error screening submodule is specifically configured to obtain M characters, starting from an nth character, of the text to be detected, and obtain at least one first set; determining the probability of the co-occurrence of the M characters in the first set according to the probability of the co-occurrence of every two characters in the corpus to obtain a second probability, and screening the suspected wrong characters from the first set based on the second probability; wherein N is greater than or equal to 1, and N is less than or equal to L-M-1, and L represents the number of characters of the text of the object to be detected.
With reference to the sixth possible implementation manner of the second aspect, an embodiment of the present application provides a seventh possible implementation manner of the second aspect, where the suspected error screening sub-module, when screening the suspected error characters from the first set, is further specifically configured to screen the first set with a second probability smaller than a second predetermined value, and obtain all characters from the first set obtained by screening, so as to obtain the suspected error characters.
With reference to the sixth possible implementation manner of the second aspect, the present embodiment provides an eighth possible implementation manner of the second aspect, where a probability that every two characters in the corpus occur together is specifically a probability that one of the characters occurs after another character;
when determining the second probability, the suspected error screening submodule is further specifically configured to determine, according to the probability that every two characters in the corpus occur together, the probability that the M characters in the first set occur together according to a first predetermined sequence, so as to obtain the second probability, where the first predetermined sequence is used to indicate the sequence of the M characters in the first set in the text to be detected.
With reference to the fifth possible implementation manner of the second aspect, an embodiment of the present application provides a ninth possible implementation manner of the second aspect, wherein the suspected error screening sub-module is further specifically configured to obtain Q vocabularies of the text to be detected, starting from the pth vocabulary, and obtain at least one second set; determining the probability of the common occurrence of Q words in the second set according to the probability of the common occurrence of every two words in the corpus to obtain a third probability; screening the suspected wrong vocabulary from the second set based on the third probability; wherein P is more than or equal to 1 and less than or equal to K-Q-1, and K represents the number of words in the text of the object to be detected.
With reference to the ninth possible implementation manner of the second aspect, an example of the present application provides a tenth possible implementation manner of the second aspect, where when the suspected error screening sub-module screens the suspected error character vocabulary from the second set, the suspected error screening sub-module is further specifically configured to screen the second set with a third probability smaller than a third predetermined value, and obtain all vocabularies from the second set obtained by screening, so as to obtain the suspected error vocabulary.
With reference to the ninth possible implementation manner of the second aspect, the present embodiment provides an eleventh possible implementation manner of the second aspect, where a probability that every two words in the corpus occur together is specifically a probability that one of the words is merged to a later word of another word;
when determining the third probability, the suspected error screening submodule is further specifically configured to determine, according to a probability that every two words in the corpus occur together, a probability that Q words in the third set occur together according to a second predetermined sequence, so as to obtain the third probability, where the second predetermined sequence is a sequence of the Q words in the second set in the text to be detected.
With reference to the fifth possible implementation manner of the second aspect, this application provides a twelfth possible implementation manner of the second aspect, where the first screening module further includes:
the preprocessing submodule is used for preprocessing all texts in the corpus;
the first probability determination submodule is used for determining the probability of the common occurrence of every two characters in the preprocessed corpus;
the second probability determination submodule is used for determining the probability that every two words in the preprocessed corpus appear together;
the suspected error screening submodule is further used for screening suspected error vocabularies and suspected error characters from the text to be detected based on the probability that every two characters in the preprocessed corpus appear together and the probability that every two vocabularies appear together.
With reference to the twelfth possible implementation manner of the second aspect, this embodiment provides a thirteenth possible implementation manner of the second aspect, wherein the first probability determination sub-module is specifically configured to determine a frequency of co-occurrence of every two characters in the preprocessed corpus, and determine a frequency of co-occurrence of every two characters according to the obtained frequency.
In combination with the thirteenth possible implementation manner of the second aspect, the present application provides a fourteenth possible implementation manner of the second aspect, where the frequency of the common occurrence of every two characters is specifically, the frequency of the occurrence of one of the characters after the occurrence of the other character.
With reference to the thirteenth possible implementation manner of the second aspect, this application provides a fifteenth possible implementation manner of the second aspect, wherein the first probability determination sub-module is further configured to add a fourth predetermined value to a frequency of co-occurrence of every two characters in the preprocessed corpus, so as to obtain an updated frequency of co-occurrence of every two characters.
With reference to the twelfth possible implementation manner of the second aspect, an embodiment of the present application provides a sixteenth possible implementation manner of the second aspect, wherein the second probability determination submodule is specifically configured to determine a frequency of co-occurrence of every two vocabularies in the preprocessed corpus, and determine a frequency of co-occurrence of every two vocabularies according to the obtained frequency.
In combination with the sixteenth possible implementation manner of the second aspect, the present example provides a seventeenth possible implementation manner of the second aspect, wherein the frequency of occurrence of each two vocabularies is specifically, the frequency of occurrence of one vocabulary after the other vocabulary.
With reference to the sixteenth possible implementation manner of the second aspect, in an embodiment of the present application, there is provided an eighteenth possible implementation manner of the second aspect, wherein the second probability determination submodule is further configured to add a fifth predetermined value to a frequency of co-occurrence of every two vocabularies in the preprocessed corpus, so as to obtain an updated frequency of co-occurrence of every two vocabularies.
With reference to the twelfth possible implementation manner of the second aspect, the present application provides a nineteenth possible implementation manner of the second aspect, where the corpus includes at least one text;
the preprocessing submodule is specifically configured to add a first predetermined character before a first character of each text, add a second predetermined character after a last character of each text, replace all characters except a chinese character in each text with third predetermined characters, and replace a plurality of consecutive third predetermined characters with one third predetermined character.
With reference to the nineteenth possible implementation manner of the second aspect, the present application provides a twentieth possible implementation manner of the second aspect, wherein the preprocessing sub-module is further specifically configured to perform word segmentation on the preprocessed text to obtain a plurality of words in the corpus;
the second probability determination submodule is further used for determining the probability of the common occurrence of every two words in the preprocessed corpus based on the words in the corpus.
With reference to the nineteenth possible implementation manner of the second aspect or the twentieth possible implementation manner of the second aspect, an embodiment of the present application provides a twenty-first possible implementation manner of the first aspect, wherein the preprocessing sub-module is further configured to preprocess the text to be detected, and the preprocessing sub-module is further specifically configured to add the first predetermined character before a first character of the text to be detected, add the second predetermined character after a last character of the text to be detected to replace all characters except a chinese character in the text to be detected with third predetermined characters, and replace a plurality of consecutive third predetermined characters with one third predetermined character.
With reference to the twenty-first possible implementation manner of the second aspect, an embodiment of the present application provides a twenty-second possible implementation manner of the second aspect, wherein the preprocessing sub-module is further configured to perform word segmentation processing on the text to be detected to obtain a plurality of words of the text to be detected;
the suspected error screening submodule is further specifically configured to screen the suspected error vocabulary based on the vocabulary of the text to be detected and the corpus.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any possible implementation of the first aspect.
In a fourth aspect, this application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.
The text error detection method and the text error detection device provided by the embodiment of the application firstly utilize a corpus storing correct texts to preliminarily screen suspected wrong characters and suspected wrong vocabularies in texts to be detected, then screen out target suspected words with higher accuracy from the vocabularies to which the suspected wrong characters belong, and finally screen out the target suspected words based on the probability that each target suspected word appears at the current position of the texts to be detected to obtain the final target wrong characters. The embodiment of the application obtains the suspected wrong characters and the suspected wrong vocabularies through preliminary screening, further performs the processing of selecting intersection and probability screening, and can effectively improve the accuracy of text error detection. Meanwhile, the text error detection method and device provided by the embodiment of the application can detect various types of texts to be detected by expanding or updating the corpus, and have strong adaptability.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
FIG. 1 is a flow chart of a text error detection method according to an embodiment of the present application;
FIG. 2 is a flow chart of a text error detection method provided in the second embodiment of the present application;
FIG. 3 is a flow chart of a text error detection method provided in the third embodiment of the present application;
FIG. 4 is a flow chart of a text error detection method provided in the fourth embodiment of the present application;
FIG. 5 is a flow chart of a text error detection method provided in the fifth embodiment of the present application;
FIG. 6 is a schematic diagram of a sliding window in an embodiment of the present application;
FIG. 7 is a diagram illustrating a structure of a text error detection apparatus name according to a seventh embodiment of the present application;
FIG. 8 is a diagram illustrating a structure of a text error detection apparatus name according to an eighth embodiment of the present application;
FIG. 9 is a diagram illustrating a structure of a text error detection apparatus name according to a ninth embodiment of the present application;
fig. 10 shows a schematic structural diagram of an electronic device provided in this embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application discloses a text error detection method and a text error detection device, aiming at the defects of low accuracy and poor adaptability of text error detection in the prior art, the method or the device can be suitable for text error detection in various application scenes, and has strong adaptability, for example, the method or the device can be used for text error detection in a bank customer service system, text error detection in a taxi taking system and text error detection in an online shopping system. Meanwhile, the text error detection method and device provided by the embodiment of the application can accurately find the error characters in the text, and the error detection precision is high.
For the convenience of understanding the embodiments of the present application, a text error detection method disclosed in the embodiments of the present application will be described in detail first.
Example one
The embodiment provides a text error detection method, which utilizes a corpus in which correct texts are stored to detect and obtain error characters (namely target error characters) in a text to be detected. Specifically, as shown in fig. 1, the text error detection method of the present embodiment includes:
s110, based on the corpus in which the correct text is stored, suspected wrong words and suspected wrong characters are screened from the text to be detected.
Here, a plurality of correct texts are stored in the corpus in advance. The texts can be correct texts acquired in a specific scene, for example, the texts are correct texts acquired in a specific application scene of the logistics service conversation; the texts may also be correct texts obtained in a variety of different application scenarios, for example, the texts are correct texts obtained in different application scenarios such as logistics customer sessions, bank customer sessions, online shopping, and the like. The correct texts can be used for text error detection in a specific application scene, and text error detection in more application scenes can be realized by adding or updating texts in the corpus.
In addition, the texts stored in the corpus can be updated according to the change of the requirement of the actual application scene, so that the accuracy of suspected wrong words and suspected wrong characters obtained according to the corpus in a specific application scene is improved.
S120, obtaining the vocabulary to which each suspected wrong character belongs from the text to be detected, and screening the vocabulary to which the suspected wrong character belongs from the vocabulary to which the suspected wrong character belongs to obtain a target suspected vocabulary.
Here, based on the suspected error vocabulary, a new suspected error word is acquired from the text to be detected, and then the intersection of the new suspected error vocabulary and the suspected error vocabulary acquired in S110 is calculated to acquire the target suspected error vocabulary.
S130, screening target error characters from the target suspected vocabulary based on the probability of each target suspected vocabulary appearing at the current position of the text to be detected.
Here, the further screening is performed according to the probability that the target suspected vocabulary appears at the current position of the text to be detected, and is a rationality judgment that each target suspected word appears at the current position, and if the result of the rationality judgment is unreasonable, the probability that the target suspected vocabulary contains wrong characters is very high, that is, the target wrong characters obtained by screening the target suspected vocabulary have very high possibility of being real wrong characters. Therefore, the technical scheme of screening by using the position probability in the step further improves the precision of the detection of the wrong characters.
In summary, the text error detection method of this embodiment does not simply and directly screen the wrong characters, but combines the screening of the suspected wrong vocabulary and the suspected wrong characters, and then, on the basis of the obtained suspected wrong vocabulary and the suspected wrong characters, adopts a processing method of calculating the intersection and screening the probability, and obtains the target wrong characters with high accuracy. Meanwhile, the text error detection method of the embodiment can detect various types of texts to be detected by expanding or updating the corpus, and has strong adaptability.
Example two
The embodiment provides a text error detection method, and the method provides a specific implementation manner for screening target error characters from the target suspected vocabulary on the basis of the previous embodiment. As shown in fig. 2, the text error detection method in this embodiment includes the following steps:
s210, based on the corpus in which the correct text is stored, suspected wrong words and suspected wrong characters are screened from the text to be detected.
S220, obtaining the vocabulary to which each suspected wrong character belongs from the text to be detected, and screening the vocabulary to which the suspected wrong characters belong from the obtained vocabulary to obtain a target suspected vocabulary.
S230, screening target wrong words from the target suspected words based on the probability that each target suspected word appears at the current position of the text to be detected.
Here, the target error vocabulary is obtained by reasonably screening the target suspected vocabulary, and has higher accuracy.
Here, the target error vocabulary may be specifically screened by the following substeps:
s2301, determining the probability of each target suspected vocabulary appearing at the current position of the text to be detected, and obtaining a first probability of each target suspected vocabulary.
S2302, screening the target suspected vocabulary with the first probability smaller than the first preset value to obtain the target error vocabulary.
The first preset value can be flexibly set according to the requirements of practical application scenes, and can be set to be a larger value so as to detect a plurality of errors each time, or can be set to be a smaller value so as to detect one or two errors each time, and error detection is carried out again after the obtained target error characters are corrected. In practical applications, it is generally assumed that characters around a target error character are correct, so if a plurality of target error characters with more recent errors are detected, the subsequent error correction will be affected greatly, and therefore, in practice, the first predetermined value is generally set to a small percentage value.
S240, screening characters belonging to the suspected error characters from all characters of the target error vocabulary to obtain the target error characters.
Here, the intersection is actually calculated between all the characters in the target error vocabulary and the suspected error characters obtained in step S210 to obtain the target error characters, and the intersection calculation operation further improves the accuracy of text error detection.
Further, in this embodiment, determining the probability that each target suspected vocabulary appears at the current position of the text to be detected, that is, determining the first probability, may be implemented by using the following steps:
s2303, obtaining a previous word of the target suspected word in the text to be detected, and obtaining a first word.
S2304, obtaining the next word of the target suspected word in the text to be detected, and obtaining a second word.
S2304, determining the probability of the common occurrence of the target suspected vocabulary, the first vocabulary and the second vocabulary according to the probability of the common occurrence of every two vocabularies in the corpus, and obtaining the first probability.
Here, the probability that every two words co-occur in the corpus is used to calculate the probability that every two words co-occur. In the specific calculation, only the probability of the common occurrence of a plurality of words can be calculated without considering the occurrence sequence of the words. Of course, the probability of the common occurrence of a plurality of words may be calculated taking into account the order in which the words occur together.
When the sequence of occurrence of the words is considered, the probability that every two words co-occur in the corpus is specifically the probability that one word will appear behind the other word, and in this case, the first probability can be calculated by the following steps: and determining the probability that the target suspected vocabulary appears behind the first vocabulary and the second vocabulary appears behind the target suspected vocabulary according to the probability that every two vocabularies in the corpus appear together.
EXAMPLE III
The embodiment provides a text error detection method, and on the basis of any one of the above embodiments, the embodiment provides a specific implementation manner for screening suspected wrong words and suspected wrong characters from a text to be detected. As shown in fig. 3, the text error detection method of the present embodiment includes:
s310, acquiring the corpus and the text to be detected.
S320, based on the probability that every two characters in the corpus occur together and the probability that every two words occur together, screening suspected wrong words and suspected wrong characters from the text to be detected.
Here, the correct text is stored in the corpus, so that the probability that every two characters or every two words in the text to be detected occur together can be calculated by using the probability that every two characters and every two words in the corpus occur together, and then suspected wrong characters and suspected wrong words can be screened according to the probability that every two characters or every two words in the text to be detected occur together.
Here, specifically, the following sub-steps may be utilized to screen the text to be detected for suspected wrong characters:
s3201, obtaining M characters of the text to be detected from the Nth character to obtain at least one first set; wherein N is greater than or equal to 1, and N is less than or equal to L-M-1, and L represents the number of characters of the text of the object to be detected.
The value of M can be flexibly set according to the requirements of the actual application scenario, for example, M is set to 3, and then 3 characters continuously appearing in the text to be detected are obtained in this step to obtain the first set.
S3202, determining the probability of the co-occurrence of the M characters in the first set according to the probability of the co-occurrence of every two characters in the corpus to obtain a second probability.
The correct text is stored in the corpus, so the probability of the common occurrence of every two characters in the text to be detected can be calculated by utilizing the probability of the common occurrence of every two characters in the corpus, and suspected wrong characters can be screened according to the probability of the common occurrence of every two characters in the text to be detected.
In the specific calculation of the second probability, only the probability that the plurality of characters co-occur may be calculated regardless of the order in which the characters occur. Of course, the probability of the common occurrence of the plurality of characters may be calculated taking into account the order in which the characters occur together.
When the order of appearance of characters is considered, the probability that every two characters in the corpus co-appear is specifically the probability that one of the characters appears behind the other character. The second probability may then be determined using the following steps: and determining the probability of the co-occurrence of the M characters in the first set according to a first predetermined sequence according to the probability of the co-occurrence of every two characters in the corpus to obtain the second probability, wherein the first predetermined sequence is used for representing the sequence of the M characters in the first set in the text to be detected.
S3203, based on the second probability, screening the suspected error characters from the first set.
Here, based on the probability of common occurrence of the characters in the first set (i.e. the second probability), the first set with the common occurrence probability smaller than a predetermined value may be screened, and then the suspected error character may be determined by using the screened first set. If the probability of the common occurrence of the characters in the first set is small and is less than a predetermined value, it indicates that the characters in the first set should not occur at the same time, but they occur in the current text to be detected at the same time, and it can preliminarily be determined that the characters are suspected to be wrong, i.e. the suspected wrong characters are determined.
In particular implementation, the suspected erroneous characters may be screened from the first set by using the following steps: and screening a first set with the second probability smaller than a second preset value, and acquiring all characters from the screened first set to obtain the suspected wrong characters. The second predetermined value can be flexibly set according to the requirements of the actual application scene, and can be set to be a larger value so as to detect a plurality of errors each time, or can be set to be a smaller value so as to detect one or two errors each time, and error detection is carried out again after the obtained target error characters are corrected.
Here, specifically, the following sub-steps may be utilized to screen the text to be detected for suspected error words:
s3204, obtaining Q vocabularies of the text to be detected from the P-th vocabulary, and obtaining at least one second set; wherein P is more than or equal to 1 and less than or equal to K-Q-1, and K represents the number of words in the text of the object to be detected.
The value of Q can be flexibly set according to the requirements of the actual application scenario, for example, Q is set to 3, and then 3 words continuously appearing in the text to be detected are obtained in this step to obtain the second set.
S3205, determining the probability of the common occurrence of Q words in the second set according to the probability of the common occurrence of every two words in the corpus to obtain a third probability.
The correct text is stored in the corpus, so that the probability that every two words in the text to be detected commonly appear can be calculated by utilizing the probability that every two words in the corpus commonly appear, and suspected wrong words can be screened according to the probability that every two words in the text to be detected commonly appear.
In the specific calculation of the third probability, only the probability of the common occurrence of the plurality of words may be calculated regardless of the order of occurrence of the words. Of course, the probability of the common occurrence of a plurality of words may be calculated taking into account the order in which the words occur together.
When the sequence of occurrence of the words is considered, the probability that every two words in the corpus occur together is specifically the probability that one word will appear behind the other word. The third probability may be determined at this point using the following steps: and determining the probability of the Q words in the third set appearing together according to a second preset sequence according to the probability of the two words in the corpus appearing together to obtain the third probability, wherein the second preset sequence is the sequence of the Q words in the second set in the text to be detected.
S3206, based on the third probability, the suspected wrong vocabulary is screened from the second set.
Here, based on the probability of the co-occurrence of the words in the second set (i.e. the third probability), the second set with the co-occurrence probability smaller than a predetermined value may be screened, and then the suspected error word may be determined by using the screened second set. If the probability of the common occurrence of the words in the second set is small and is less than a predetermined value, it indicates that the words in the second set should not occur at the same time, but they occur in the current text to be detected at the same time, and it can be preliminarily determined that the words are suspected to be wrong, i.e. the suspected wrong words are determined.
In a specific implementation, the suspected erroneous vocabulary may be screened from the second set by using the following steps: and screening a second set with the third probability smaller than a third preset value, and acquiring all words from the screened second set to obtain the suspected wrong words. The third predetermined value can be flexibly set according to the requirements of the actual application scene, and can be set to be a larger value so as to detect a plurality of errors each time, or can be set to be a smaller value so as to detect one or two errors each time, and error detection is carried out again after the obtained target error characters are corrected.
S330, obtaining the vocabulary to which each suspected wrong character belongs from the text to be detected, and screening the vocabulary to which the suspected wrong character belongs from the obtained vocabulary to obtain a target suspected vocabulary.
S340, screening target error characters from the target suspected vocabulary based on the probability of each target suspected vocabulary appearing at the current position of the text to be detected.
Example four
The embodiment provides a text error detection method, and on the basis of the above embodiment, the embodiment provides a specific implementation manner that before suspected wrong words and suspected wrong characters are screened from a text to be detected, the text in a corpus is preprocessed, and the probability of common occurrence of every two characters and the probability of common occurrence of every two words in the corpus are determined. As shown in fig. 4, the text error detection method of the present embodiment includes:
s410, acquiring the corpus and the text to be detected.
And S420, preprocessing all texts in the corpus.
Here, the preprocessing of each text in the corpus may be, but is not limited to, the following processing manner:
adding a first predetermined character before a first character of each text; adding a second predetermined character after the last character of each text; replacing all characters except the Chinese character in each text with a third preset character; and replacing a consecutive plurality of third predetermined characters with one of said third predetermined characters. Specifically, how to set the value of the first predetermined character, the value of the second predetermined character, and the value of the third predetermined character can be flexibly set according to actual needs, for example, the first predetermined character is set as a character "S", the second predetermined character is set as a character "E", and the third predetermined character is set as a character "P".
And S430, determining the probability of the common occurrence of every two characters in the preprocessed corpus.
Here, the probability of each two characters co-occurring in the corpus can be determined using the following sub-steps: and determining the co-occurrence frequency of every two characters in the preprocessed corpus, and determining the co-occurrence frequency of every two characters according to the obtained frequency.
Specifically, the frequency of occurrence of each two characters can be obtained by dividing the frequency of occurrence of each two characters by the frequency of occurrence of one of the characters. Here, the frequency of the co-occurrence of every two characters can be represented by a co-occurrence frequency matrix of the characters, wherein the i-th row and j-th column elements of the matrix represent the frequency of the co-occurrence of the i-th character and the j-th character. The number of times of occurrence of the jth character can be obtained by adding the values of the jth column of the matrix, and the frequency p (i | j) of the ith character and the jth character which commonly occur can be obtained by dividing the number of times by the value of the ith row and the jth column. The co-occurrence frequency of two characters can also be represented by a co-occurrence frequency matrix, wherein the p-th row and Q-th column elements of the matrix represent the co-occurrence frequency of the p-th character and the Q-th character.
The co-occurrence frequency described above does not take into account the order of two characters, and the order of two characters will now be taken into account, and a description will be given of how to determine the co-occurrence frequency of every two characters in a case where the order of every two characters is taken into account. Specifically, the frequency of the common occurrence of each of the above two characters is now the frequency with which one of the characters occurs later than the other character. At this time, the ith row and jth column elements of the co-occurrence frequency matrix represent the frequency of occurrence of the ith character after the jth character. The number of times of occurrence of the jth character can be obtained by adding the values of the jth column of the matrix, and the frequency p (i | j) of the ith character after the jth character can be obtained by dividing the number of times by the value of the ith row and the jth column. The co-occurrence frequency of two characters can also be represented by a co-occurrence frequency matrix, wherein the p-th row and Q-th column elements of the matrix represent the co-occurrence frequency of the p-th character and the Q-th character.
In addition, after the co-occurrence frequency of two characters is obtained, in order to avoid the situation that the co-occurrence frequency of a certain character is 0, laplacian smoothing may be performed on the co-occurrence frequency matrix, that is, the co-occurrence frequency of every two characters in the preprocessed corpus is added to a fourth predetermined value, so as to obtain an updated co-occurrence frequency of every two characters. The fourth predetermined value can be flexibly set according to actual requirements, for example, the fourth predetermined value is set to 1.
And S440, determining the probability of the common occurrence of every two vocabularies in the preprocessed corpus.
Here, the probability of each two words co-occurring in the corpus can be determined using the following sub-steps: and determining the co-occurrence frequency of every two words in the preprocessed corpus, and determining the co-occurrence frequency of every two words according to the obtained frequency.
Specifically, the frequency of the common occurrence of each two words can be obtained by dividing the frequency of the common occurrence of each two words by the frequency of the common occurrence of one of the words. Here, the frequency of co-occurrence of every two vocabularies can be represented by a co-occurrence frequency matrix of the vocabularies, and the ith row and the jth column of the matrix represent the frequency of co-occurrence of the ith vocabulary and the jth vocabulary. The number of occurrences of the jth word is obtained by adding the values in the jth column of the matrix, and the frequency p (i | j) of the ith word and the jth word occurring together is obtained by dividing the number by the value in the ith row and jth column. The co-occurrence frequency of two vocabularies can also be represented by a co-occurrence frequency matrix, wherein the p-th row and the Q-th column of the matrix represent the co-occurrence frequency of the p-th vocabulary and the Q-th vocabulary.
The co-occurrence frequency described above does not take into account the order of the two words, and the order of the two words is now taken into account, and a description will be given of how to determine the co-occurrence frequency of every two words in a case where the order of every two words is taken into account. Specifically, the frequency of the common occurrence of each of the two vocabularies is the frequency of the occurrence of one vocabulary in the latter vocabulary. In this case, the i row and j column elements of the co-occurrence frequency matrix represent the frequency of occurrence of the i word after the j word. The number of occurrences of the jth word is obtained by adding the values in the jth column of the matrix, and the frequency p (i | j) of the ith word occurring after the jth word is obtained by dividing the number by the value in the ith row and jth column. The co-occurrence frequency of two vocabularies can also be represented by a co-occurrence frequency matrix, wherein the element of the p-th row and the Q-th column of the matrix represents the co-occurrence frequency of the p-th vocabulary and the Q-th vocabulary.
In addition, after the co-occurrence frequency of two vocabularies is obtained, in order to avoid the situation that the co-occurrence frequency of a certain vocabulary is 0, laplacian smoothing may be performed on the co-occurrence frequency matrix, that is, the co-occurrence frequency of every two vocabularies in the preprocessed corpus is added to a fifth predetermined value, so as to obtain the updated co-occurrence frequency of every two vocabularies. The fifth predetermined value can be flexibly set according to actual requirements, for example, the fifth predetermined value is set to 1.
S450, based on the probability that every two characters in the preprocessed corpus occur together and the probability that every two words occur together, screening suspected wrong words and suspected wrong characters from the text to be detected.
S460, obtaining the vocabulary to which each suspected wrong character belongs from the text to be detected, and screening the vocabulary to which the suspected wrong character belongs from the obtained vocabulary to obtain a target suspected vocabulary.
S470, screening target error characters from the target suspected vocabulary based on the probability of each target suspected vocabulary appearing at the current position of the text to be detected.
Further, before determining the probability of the co-occurrence of every two vocabularies in the preprocessed corpus, the following steps should be included: and performing word segmentation on the preprocessed text to obtain a plurality of words of the corpus. Based on the vocabulary of the corpus, the probability of the common occurrence of every two vocabularies in the preprocessed corpus can be determined.
Further, before the step S450 is executed, the method may further include the step of preprocessing the text to be detected: adding the first preset character before the first character of the text to be detected; adding the second predetermined character after the last character of the text to be detected; replacing all characters except the Chinese character in the text to be detected with a third preset character; and replacing a consecutive plurality of third predetermined characters with one of said third predetermined characters.
Further, after the text to be detected is preprocessed, and before the step S450 is executed, the following steps should be included: and performing word segmentation processing on the text to be detected to obtain a plurality of words of the text to be detected. And screening the suspected wrong vocabulary based on the vocabulary of the text to be detected and the corpus.
EXAMPLE five
The present embodiment provides a text error detection method, as shown in fig. 5, the method of the present embodiment includes three parts: the first part is to preprocess the original text data, namely the text in the corpus and the text to be detected, to obtain the frequency of the common appearance of every two characters in the corpus by statistics, and to obtain the frequency of the common appearance of every two words in the corpus by statistics. The first part is a preparation stage before text error detection. The second part is a stage of screening to obtain suspected wrong characters and suspected wrong words by taking the characters and the words as objects in parallel respectively, and the second part is a generation stage. The third part is a stage of screening and combining suspected error words generated before to obtain final target error characters, and the third part is a screening stage.
As shown in fig. 5, the text error detection method of the present embodiment includes:
first part, preparation phase: the inputs to this stage are: a complete corpus and the text to be detected. The output of the stage is a co-occurrence frequency matrix of every two characters, a co-occurrence frequency matrix of every two words and phrases and a preprocessed text to be detected which can be used for subsequent error detection.
Specifically, the first part comprises the following steps:
the method comprises the steps of firstly, preprocessing texts, adding a starting identifier 'S' at the beginning position of each text and the text to be detected in a corpus, adding an ending identifier 'E' at the ending position of each text, and replacing punctuation marks, numbers, letters, special characters and the like in a sentence with a special identifier 'P'. It should be noted that if successive "P" s are present, they are merged into one. In the subsequent processing, "S", "E" and "P" are regarded as three legitimate chinese characters.
And secondly, performing word segmentation on the corpus and the text to be detected by using an existing word segmentation tool, such as jieba word segmentation, so as to obtain a list of corresponding words.
And step three, traversing each text of the corpus, recording all the appeared characters, and then uniquely numbering each character to obtain one-to-one mapping from the characters to the numbers.
And step four, traversing a list of vocabularies generated by each text of the corpus, recording all the appeared vocabularies, and then uniquely numbering each vocabulary group to obtain one-to-one mapping from the vocabularies to the numbers.
And fifthly, traversing each text of the corpus again, and counting the frequency of the common occurrence of every two characters to obtain a co-occurrence frequency matrix A of the characters, wherein the ith row and jth column element of the matrix A represent the frequency of the ith character after the jth character.
And step six, traversing the vocabulary list generated by each text of the corpus again, and counting the occurrence frequency of each vocabulary to obtain a co-occurrence frequency matrix B of every two vocabularies, wherein the ith row and jth column elements of the B represent the occurrence frequency of the ith vocabulary after the jth vocabulary.
And seventhly, in order to avoid the situation that the co-occurrence frequency of a certain character or vocabulary is 0, performing Laplace smoothing on the co-occurrence frequency matrix, namely adding 1 to each element of the matrix A and the matrix B.
And step eight, adding the values of the matrix A or B in the jth column to obtain the occurrence frequency of the jth character or vocabulary, and dividing the occurrence frequency by the value of the ith row and the jth column to obtain the frequency p (i | j) of the ith character or vocabulary after the jth character or vocabulary, thereby obtaining the co-occurrence frequency matrix PA of the characters and the co-occurrence frequency matrix PB of the vocabulary.
By using the co-occurrence frequency matrix and the co-occurrence frequency matrix obtained in the preparation stage, the frequency or frequency of inquiring the appearance of one character or vocabulary after another character or vocabulary can be quickly obtained.
And a second part, a generation phase. The inputs to this stage are: the method comprises the steps of obtaining a co-occurrence frequency matrix of every two characters, a co-occurrence frequency matrix of every two words and a text to be detected by mistake (including obtaining each word of the text to be detected) processed in a preparation stage. The output of this stage is: the suspected wrong characters and the suspected wrong vocabulary are generated by taking the characters as objects, and the suspected wrong vocabulary is generated by taking the vocabulary as the objects.
Specifically, the second part comprises the following steps:
step one, as shown in fig. 6, the probability of the common occurrence of all characters in each window is calculated using a sliding window with a window size of 3 in units of characters. Assuming that a, b and c are three characters in the window and their numbers are i, j and k, respectively, the probability of the common occurrence of all the characters in the window is p (j | i) p (k | j). Assuming that the length of the text is I, the probability of the co-occurrence corresponding to I-2 windows can be obtained.
Step two, using the percentile method, finding the subscripts of all windows below a certain predetermined percentile value (i.e. the second predetermined value in the above embodiment), where the characters in these windows are the initial suspected error characters.
And step three, finding out the vocabulary which contains the suspected wrong characters in the step two in the sequence of the vocabulary of the text to be detected, and obtaining the position of the characters which are possibly wrong in the sentence. In the step, the suspected wrong vocabulary is obtained by taking the characters as objects.
And step four, taking the vocabulary as a unit, and calculating the probability of the common occurrence of all the vocabularies in each window by using a sliding window with the window size of 3. Assuming that there are A, B, C words in the window, each numbered I, J, K, the probability of all words in the window co-occurring is p (J | I) p (K | J). Assuming that the number of the words after the word segmentation of the text to be detected is L, the common occurrence probability corresponding to L-2 windows can be obtained.
And step five, finding all window subscripts which are lower than a certain preset percentile value (namely a third preset value in the embodiment) by using a percentile method, wherein the words in the windows are suspected error words. In the step, the suspected wrong vocabulary is obtained by taking the vocabulary as an object.
The third part, the screening stage. The inputs to this stage are: the suspected wrong vocabulary and the suspected wrong characters are generated by taking the characters as objects, and the suspected wrong vocabulary is generated by taking the vocabulary as the objects. The output of this stage is: the final target error character.
Specifically, the third section includes the steps of:
step one, intersecting the suspected wrong words generated by taking the characters as objects and the suspected wrong words generated by taking the words as objects so as to screen out the words which are possibly detected by errors and obtain the target suspected words.
And step two, performing rationality judgment on each target suspected vocabulary in the step one. Assuming that a target suspected vocabulary is word, the corresponding serial number is index, and the vocabulary sequence after the text to be detected is word _ list, the rationality judgment method is
p(words_list[index+1]|word)*p(word|words_list[index-1])<thre shold
The threshold is a predetermined threshold (i.e., the first predetermined value in the above embodiment). The vocabulary for which the above formula holds may be wrong and is the target wrong vocabulary.
And step three, screening suspected error characters appearing in the target error vocabulary obtained in the step two to obtain final target error characters.
It is generally assumed that characters surrounding an erroneous character are correct, and therefore, if a plurality of target erroneous characters with relatively close positions are detected, the correction will be affected greatly. Therefore, in practice, a smaller percentile value (i.e., the first predetermined value, the second predetermined value, or the third predetermined value described above) is generally used.
The text error detection method of the present embodiment uses a sliding window to detect errors based on an n-gram language model. The text error detection method of the embodiment combines the error detection with vocabulary as an object on the basis of only using a single character for error detection, and simultaneously takes the character or vocabulary with the common occurrence probability smaller than a certain preset percentile value as suspected wrong character and suspected wrong vocabulary which may have errors, so as to avoid errors which may occur in the missed text. In the text error detection method of the embodiment, after the suspected wrong characters and the suspected wrong vocabulary are obtained, a series of screening is performed on the suspected wrong characters and the suspected wrong vocabulary to obtain the final target wrong characters, so that the original correct characters can be prevented from being detected as errors as much as possible.
EXAMPLE six
This embodiment provides a specific implementation of a text error detection method, where the corpus in this embodiment specifically includes: the texts corresponding to the three-month customer service dialog assume that one of the texts is: why did my service score decrease? "the text to be detected is: what is what is done when the prize is diluted? ". The following is a detailed description of how detection is performed using the text error detection method of the present application.
A first part, the preparation phase, comprising:
step one, adding and replacing identifiers in all customer service conversation texts and texts to be detected. For example, "why did my service score decrease? Changing to "why S my service score lowers PE". What is what is done when the prize is diluted? "what is not to be assigned to the PE when it becomes the S dilution prize".
And step two, performing word segmentation on the texts in the corpus and the texts to be mistaken to obtain a list of corresponding words. For example, "why my service score is lowered by the S" the list of words corresponding to the PE "is [" S "," why "," my "," service score "," lowered "," P "," E "]," the list of words corresponding to the PE "of how not to assign the value when the S is rewound" is [ "S", "rewound", "prize", "time", "how to what", "not", "assign", "P", "E" ].
Step three, traversing each text in the corpus, recording all the occurring Chinese characters, and then numbering each character uniquely, for example, { "uniform": 1, "service": 2, divide: 3, "S": 4, "P": 5, is: 6, "multiply": 7, … }.
Step four, traversing the vocabulary list generated by each text in the corpus, recording all the appeared vocabularies, and then numbering each vocabulary uniquely: { "service part": 1, "why": 2, "S": 3, "E": 4, … }.
And step five, traversing each text in the corpus again, and counting the frequency of the common occurrence of every two characters to obtain a co-occurrence frequency matrix A of the characters. E.g., row 1, column 2 of matrix a represents the number of times "service" occurs after "service".
And step six, traversing the vocabulary list generated by each text in the corpus again, and counting the occurrence frequency of each vocabulary to obtain a co-occurrence frequency matrix B of the vocabularies, wherein the 1 st row and the 2 nd column of the matrix B represent the occurrence frequency of the reason after the service score.
Step seven, some element of 0 may appear in the matrixes a and B, for example, the word "stream" may not appear before the word "set", and the word "passenger" may not appear after the word "sentence". To avoid this problem, 1 is added to each element of matrix a and matrix B.
And step eight, adding the values of a row of the matrix to obtain the occurrence number of the corresponding character or vocabulary of the row, for example, the addition of the 2 nd row of the matrix A is the total occurrence number of the 'task' word in the corpus. Dividing the value of a certain row of the column by the total times to obtain the corresponding frequency, for example, dividing the value of the second row of the first row by the value of the second column of the second row is the frequency p of the service after the service appears ("service" | "service"). Thus, a co-occurrence frequency matrix PA of characters and a co-occurrence frequency matrix PB of vocabulary can be obtained.
A second part, the generation phase, comprising:
step one, calculating the probability of the occurrence of all characters in each window by using a sliding window with the window size of 3 in a text to be detected by taking characters as a unit. As for the text "how" S is "not assigned to a PE" when S is diluted, the windows are "S is diluted", "diluted prize", "time of prize", … the probability p ("S is diluted") of appearance of all characters in one window is "p (" is "S") p ("is" p ").
And step two, setting a percentile value to be 5 by using a percentile method, and finding all windows which are lower than the percentile value and are 'light prize'. At this time, the three characters of "light", "prize" and "false" are all considered to be suspected false characters.
Step three, finding out the words containing 'light', 'prize', 'and' in the sequence of the words of the text to be detected, and obtaining: the "dilution", "prize", "time" are used as the suspected wrong words obtained by using the characters as the units.
And step four, taking the vocabulary as a unit, and calculating the probability of the common occurrence of all the vocabularies in each window by using a sliding window with the window size of 3. And (3) calculating the word co-occurrence probability of the word sequences of 'S', 'diluting', 'rewarding', 'time', 'what' and 'not', 'order allocation', 'P' and 'E'. The windows are in order [ "S", "fade", "prize" ], [ "fade", "prize", "time" ], [ "prize", "time", "what" ], …, for example, p ([ "S", "fade", "prize" ]) ═ p ("fade" | "S") p ("prize" | "fade").
And step five, setting a percentile value to be 5 by using a percentile method, finding all windows lower than the percentile value to be [ 'S', 'dilution', 'prize' ], and taking the windows as suspected wrong words obtained by taking the words as units.
A third, screening phase, comprising:
step one, intersecting the suspected wrong words obtained in the step three of the generation stage and the step five of the generation stage, namely ' diluting ', ' rewarding ', ' time of ' and ' S ', ' diluting ', ' rewarding ' and ' intersection is taken to obtain ' diluting ', ' rewarding ', and the intersecting words are used as target suspected words.
And step two, performing rationality judgment on each obtained target suspected vocabulary. If the rationality of "dilution" is to be judged, it is necessary to judge
p ("prize" | "dilution"). p ("dilution" | "S") < threshold
Whether or not this is true, where threshold is a predetermined threshold (i.e., the first predetermined value in the above embodiment). And if so, taking the target suspected vocabulary as the target error vocabulary.
And step three, comparing all characters in the target error vocabulary with the three suspected error characters of 'light', 'prize', 'and' obtained in the step two of the generation stage, and finally obtaining 'light' and 'prize', wherein the three suspected error characters are used as target error characters.
The text error detection method of the embodiment simultaneously uses characters and vocabulary as objects to search suspected wrong vocabularies and suspected wrong characters, and combines the percentile values (namely the first preset value, the second preset value and the third preset value) to perform screening, so that the ranges of the suspected wrong vocabularies and the suspected wrong characters are expanded, and then the screening is performed, thereby not only avoiding the defect that the errors are not detected, but also avoiding the defect that the original correct characters are detected as the wrong characters.
Based on the same technical concept, embodiments of the present application further provide a text error detection apparatus, an electronic device, a computer storage medium, and the like, which can be seen in the following embodiments.
EXAMPLE seven
The present embodiment provides a text error detection apparatus, as shown in fig. 7, the apparatus includes:
the first screening module 701 is used for screening suspected wrong words and suspected wrong characters from the text to be detected based on the corpus in which the correct text is stored;
a second screening module 702, configured to obtain, from the text to be detected, a vocabulary to which each suspected error character belongs, and screen, from the vocabularies to which the suspected error characters belong, vocabularies belonging to the suspected error vocabularies to obtain a target suspected vocabulary;
the third screening module 703 is configured to screen a target wrong character from the target suspected vocabulary based on a probability that each target suspected vocabulary appears at the current position of the text to be detected.
Further, as shown in fig. 7, the third screening module 703 includes:
the target wrong vocabulary screening submodule 7031 is configured to screen a target wrong vocabulary from the target suspected vocabularies based on the probability that each target suspected vocabulary appears at the current position of the text to be detected;
and a target error character screening submodule 7032, configured to screen characters belonging to the suspected error character from all characters of the target error vocabulary, so as to obtain the target error character.
The target wrong vocabulary screening submodule 7031 is specifically configured to determine a probability that each target suspected vocabulary appears at the current position of the text to be detected, obtain a first probability of each target suspected vocabulary, and screen a target suspected vocabulary having the first probability smaller than a first predetermined value, so as to obtain the target wrong vocabulary.
When determining the first probability, the target wrong vocabulary screening submodule 7031 is further configured to obtain a previous vocabulary of the target suspected vocabulary in the text to be detected, obtain a first vocabulary, obtain a next vocabulary of the target suspected vocabulary in the text to be detected, obtain a second vocabulary, and determine, according to a probability that every two vocabularies in the corpus occur together, a probability that the target suspected vocabulary, the first vocabulary, and the second vocabulary occur together, so as to obtain the first probability.
The probability that every two words in the corpus co-occur may be specifically a probability that one word will appear behind the other word. Then, when determining the first probability, the target wrong vocabulary screening submodule 7031 is further configured to determine, according to a probability that every two vocabularies in the corpus co-occur, a probability that the target suspected vocabulary appears behind the first vocabulary and a probability that the second vocabulary appears behind the target suspected vocabulary.
Example eight
This embodiment provides a text error detection apparatus, as shown in fig. 8, the apparatus includes:
the first screening module 801 is configured to screen suspected wrong words and suspected wrong characters from a text to be detected based on a corpus in which correct texts are stored;
the second screening module 802 is configured to obtain a vocabulary to which each suspected error character belongs from the text to be detected, and screen a vocabulary belonging to the suspected error vocabulary from the obtained vocabularies to obtain a target suspected vocabulary;
and the third screening module 803 is configured to screen a target wrong character from the target suspected vocabulary based on the probability that each target suspected vocabulary appears at the current position of the text to be detected.
Further, as shown in fig. 8, the first filtering module 801 includes:
an obtaining submodule 8011, configured to obtain the corpus and the text to be detected;
the suspected error screening submodule 8012 is configured to screen the suspected error vocabulary and the suspected error character from the text to be detected based on the probability that every two characters in the corpus occur together and the probability that every two vocabularies occur together.
Further, the suspected error screening submodule 8012 is specifically configured to acquire M characters of the text to be detected, starting from the nth character, to obtain at least one first set; determining the probability of the co-occurrence of the M characters in the first set according to the probability of the co-occurrence of every two characters in the corpus to obtain a second probability, and screening the suspected wrong characters from the first set based on the second probability; wherein N is greater than or equal to 1, and N is less than or equal to L-M-1, and L represents the number of characters of the text of the object to be detected.
Further, when the suspected error character is screened from the first set, the suspected error screening sub-module 8012 is further configured to screen the first set with a second probability smaller than a second predetermined value, and obtain all characters from the first set obtained by screening, so as to obtain the suspected error character.
The probability that every two characters in the corpus occur together is specifically the probability that one of the characters occurs after the other character. At this time, when determining the second probability, the suspected error screening sub-module 8012 is further specifically configured to determine, according to the probability that every two characters in the corpus occur together, the probability that the M characters in the first set occur together according to a first predetermined sequence, so as to obtain the second probability, where the first predetermined sequence is used to indicate the sequence of the M characters in the first set in the text to be detected.
Further, the suspected error screening submodule 8012 is further specifically configured to acquire Q vocabularies of the text to be detected, starting from the pth vocabulary, to obtain at least one second set; determining the probability of the common occurrence of Q words in the second set according to the probability of the common occurrence of every two words in the corpus to obtain a third probability; screening the suspected wrong vocabulary from the second set based on the third probability; wherein P is more than or equal to 1 and less than or equal to K-Q-1, and K represents the number of words in the text of the object to be detected.
Further, when the suspected error screening sub-module 8012 is configured to screen the suspected error character vocabulary from the second set, it is further specifically configured to screen the second set with a third probability smaller than a third predetermined value, and obtain all vocabularies from the second set obtained by the screening, so as to obtain the suspected error vocabulary.
The probability that every two words in the corpus occur together is specifically the probability that one word will appear behind the other word. At this time, when determining the third probability, the suspected error screening sub-module 8012 is further configured to determine, according to a probability that every two words in the corpus occur together, a probability that Q words in the third set occur together according to a second predetermined sequence, so as to obtain the third probability, where the second predetermined sequence is a sequence of the Q words in the second set in the text to be detected.
Example nine
This embodiment provides a text error detection apparatus, as shown in fig. 9, the apparatus includes:
the first screening module 901 is configured to screen suspected wrong words and suspected wrong characters from a text to be detected based on a corpus in which correct texts are stored;
the second screening module 902 is configured to obtain the vocabulary to which each suspected error character belongs from the text to be detected, and screen the vocabulary to which the suspected error character belongs from the obtained vocabulary to obtain a target suspected vocabulary;
and a third screening module 903, configured to screen a target wrong character from the target suspected vocabulary based on a probability that each target suspected vocabulary appears at the current position of the text to be detected.
Further, as shown in fig. 9, the first filtering module 901 includes:
an obtaining sub-module 9011, configured to obtain the corpus and the text to be detected;
a preprocessing submodule 9012, configured to perform preprocessing on all texts in the corpus;
a first probability determination submodule 9013, configured to determine a probability that every two characters in the preprocessed corpus occur together;
a second probability determination submodule 9014, configured to determine a probability that every two words in the preprocessed corpus occur together;
and the suspected error screening submodule 9015 is configured to screen the suspected error vocabulary and the suspected error characters from the text to be detected based on the probability that every two characters in the preprocessed corpus occur together and the probability that every two vocabularies occur together.
Further, the first probability determination sub-module 9013 is specifically configured to determine a frequency of occurrence of each two characters in the preprocessed corpus, and determine a frequency of occurrence of each two characters according to the obtained frequency. The frequency of the common occurrence of every two characters may be embodied as a frequency in which one of the characters occurs after the other character.
Further, the first probability determination submodule 9013 is further configured to add the frequency of occurrence of every two characters in the preprocessed corpus to a fourth predetermined value to obtain an updated frequency of occurrence of every two characters
Further, the second probability determination submodule 9014 is specifically configured to determine a frequency of occurrence of each two words in the preprocessed corpus, and determine a frequency of occurrence of each two words according to the obtained frequency. The frequency of the common occurrence of every two vocabularies can be embodied as the frequency of the occurrence of one vocabulary after the other vocabulary.
Further, the second probability determination submodule 9014 is further configured to add the frequency of occurrence of every two words in the preprocessed corpus to a fifth predetermined value, so as to obtain an updated frequency of occurrence of every two words.
Further, the corpus includes at least one text. The preprocessing submodule 9011 is specifically configured to add a first predetermined character before a first character of each text, add a second predetermined character after a last character of each text, replace all characters except a chinese character in each text with third predetermined characters, and replace a plurality of consecutive third predetermined characters with one third predetermined character.
The preprocessing submodule is further specifically used for performing word segmentation processing on the preprocessed text to obtain a plurality of words of the corpus; the second probability determination submodule 9014 is further configured to determine, based on the vocabulary in the corpus, a probability that every two vocabularies in the preprocessed corpus occur together.
Further, the preprocessing sub-module 9011 is further configured to preprocess the text to be detected, and the preprocessing sub-module is further specifically configured to add the first predetermined character before a first character of the text to be detected, add the second predetermined character after a last character of the text to be detected, replace all characters except the chinese character in the text to be detected with third predetermined characters, and replace a plurality of consecutive third predetermined characters with one third predetermined character.
The preprocessing submodule 9011 is further configured to perform word segmentation processing on the text to be detected to obtain a plurality of words of the text to be detected; the suspected error screening submodule 9015 is further specifically configured to screen the suspected error vocabulary based on the vocabulary of the text to be detected and the corpus.
Example ten
The present embodiment discloses an electronic device, as shown in fig. 10, including: a processor 1001, a memory 1002, and a bus 1003, wherein the memory 1002 stores machine-readable instructions executable by the processor 1001, and wherein the processor 1001 and the memory 1002 communicate via the bus 1003 when the electronic device is operated.
The machine readable instructions, when executed by the processor 1001, perform the following text detection steps:
based on the corpus in which the correct text is stored, screening suspected wrong words and suspected wrong characters from the text to be detected;
acquiring the vocabulary to which each suspected error character belongs from the text to be detected, and screening the vocabulary to which the suspected error character belongs from the acquired vocabulary to obtain a target suspected vocabulary;
and screening target error characters from the target suspected vocabulary based on the probability of each target suspected vocabulary appearing at the current position of the text to be detected.
In specific implementation, the step of screening the target suspected word for the target error character by the processor 1001 specifically includes:
screening target wrong words from the target suspected words based on the probability of each target suspected word appearing at the current position of the text to be detected;
and screening characters belonging to the suspected error characters from all characters of the target error vocabulary to obtain the target error characters.
In an implementation, the screening, by the processor 1001, a target wrong vocabulary from the target suspected vocabulary specifically includes:
determining the probability of each target suspected vocabulary appearing at the current position of the text to be detected to obtain a first probability of each target suspected vocabulary;
and screening the target suspected vocabulary with the first probability smaller than a first preset value to obtain the target error vocabulary.
In particular implementations, the processor 1001 determines the first probability includes:
acquiring a previous word of the target suspected word in the text to be detected to obtain a first word;
acquiring a latter vocabulary of the target suspected vocabulary in the text to be detected to obtain a second vocabulary;
and determining the probability of the common occurrence of the target suspected vocabulary, the first vocabulary and the second vocabulary according to the probability of the common occurrence of every two vocabularies in the corpus to obtain the first probability.
In specific implementation, the probability that every two words in the corpus occur together is specifically the probability that one word is collected behind the other word; at this time, the processor 1001 determines the first probability includes:
and determining the probability that the target suspected vocabulary appears behind the first vocabulary and the second vocabulary appears behind the target suspected vocabulary according to the probability that every two vocabularies in the corpus appear together.
In specific implementation, the screening, by the processor 1001, of the suspected wrong vocabulary and the suspected wrong characters from the text to be detected specifically includes:
acquiring the corpus and a text to be detected;
and screening suspected wrong vocabularies and suspected wrong characters from the text to be detected based on the probability of the common occurrence of every two characters and the probability of the common occurrence of every two vocabularies in the corpus.
In specific implementation, the screening, by the processor 1001, of the suspected incorrect character from the text to be detected specifically includes:
acquiring M characters of the text to be detected from the Nth character to obtain at least one first set; wherein N is more than or equal to 1, and N is less than or equal to L-M-1, and L represents the number of characters of the text of the object to be detected;
determining the probability of the co-occurrence of the M characters in the first set according to the probability of the co-occurrence of every two characters in the corpus to obtain a second probability;
and screening the suspected wrong characters from the first set based on the second probability.
In specific implementation, the screening, by the processor 1001, the suspected erroneous character from the first set specifically includes:
and screening a first set with the second probability smaller than a second preset value, and acquiring all characters from the screened first set to obtain the suspected wrong characters.
In a specific implementation, the probability that every two characters in the corpus occur together is specifically the probability that one of the characters occurs behind the other character; at this time, the processor 1001 determines the second probability includes:
and determining the probability of the co-occurrence of the M characters in the first set according to a first predetermined sequence according to the probability of the co-occurrence of every two characters in the corpus to obtain the second probability, wherein the first predetermined sequence is used for representing the sequence of the M characters in the first set in the text to be detected.
In specific implementation, the screening, by the processor 1001, of the suspected incorrect vocabulary from the text to be detected specifically includes:
obtaining Q vocabularies of the text to be detected from the P-th vocabulary, and obtaining at least one second set; p is more than or equal to 1 and less than or equal to K-Q-1, and K represents the number of words in the text of the object to be detected;
determining the probability of the common occurrence of Q words in the second set according to the probability of the common occurrence of every two words in the corpus to obtain a third probability;
and screening the suspected wrong vocabulary from the second set based on the third probability.
In an implementation, the screening, by the processor 1001, the suspected erroneous character vocabulary from the second set specifically includes:
and screening a second set with the third probability smaller than a third preset value, and acquiring all words from the screened second set to obtain the suspected wrong words.
In specific implementation, the probability that every two words in the corpus occur together is specifically the probability that one word is collected behind the other word; at this time, the processor 1001 determines the third probability includes:
and determining the probability of the Q words in the third set appearing together according to a second preset sequence according to the probability of the two words in the corpus appearing together to obtain the third probability, wherein the second preset sequence is the sequence of the Q words in the second set in the text to be detected.
In particular implementation, the processor 1001 performs the following steps:
preprocessing all texts in the corpus;
determining the probability of common occurrence of every two characters in the preprocessed corpus;
determining the probability of the common occurrence of every two vocabularies in the preprocessed corpus;
and screening suspected wrong vocabularies and suspected wrong characters from the text to be detected based on the probability of common occurrence of every two characters and the probability of common occurrence of every two vocabularies in the preprocessed corpus.
In specific implementation, the determining, by the processor 1001, the probability that every two characters in the preprocessed corpus occur together specifically includes:
and determining the co-occurrence frequency of every two characters in the preprocessed corpus, and determining the co-occurrence frequency of every two characters according to the obtained frequency.
In a specific implementation, the frequency of the common occurrence of every two characters is, specifically, the frequency of the occurrence of one character after the occurrence of the other character.
In particular implementation, the processor 1001 is further configured to:
and adding the co-occurrence frequency of every two characters in the preprocessed corpus to a fourth preset value to obtain the updated co-occurrence frequency of every two characters.
In specific implementation, the determining, by the processor 1001, the probability that every two words in the preprocessed corpus occur together specifically includes:
and determining the co-occurrence frequency of every two words in the preprocessed corpus, and determining the co-occurrence frequency of every two words according to the obtained frequency.
In practice, the frequency of occurrence of each of the two words is specifically the frequency of occurrence of one of the words in the vocabulary after the other.
In particular implementation, the processor 1001 is further configured to:
and adding the co-occurrence frequency of every two words in the preprocessed corpus to a fifth preset value to obtain the updated co-occurrence frequency of every two words.
In particular implementations, the corpus includes at least one text;
the preprocessing of all texts in the corpus by the processor 1001 specifically includes:
adding a first predetermined character before a first character of each text;
adding a second predetermined character after the last character of each text;
replacing all characters except the Chinese character in each text with a third preset character;
replacing a consecutive plurality of third predetermined characters with one of said third predetermined characters.
In particular implementation, the processor 1001 is further configured to:
performing word segmentation on the preprocessed text to obtain a plurality of words of the corpus;
and determining the probability of the common occurrence of every two words in the preprocessed corpus based on the words in the corpus.
In specific implementation, the processor 1001 is further configured to perform preprocessing on the text to be detected:
adding the first preset character before the first character of the text to be detected;
adding the second predetermined character after the last character of the text to be detected;
replacing all characters except the Chinese character in the text to be detected with a third preset character;
replacing a consecutive plurality of third predetermined characters with one of said third predetermined characters.
In particular implementation, the processor 1001 is further configured to:
performing word segmentation processing on the text to be detected to obtain a plurality of words of the text to be detected;
and screening the suspected wrong vocabulary based on the vocabulary of the text to be detected and the corpus.
EXAMPLE eleven
The present embodiment discloses a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the steps in the text error detection method of the above-described embodiment.
The present application further provides a computer program product for performing text error detection, which includes a computer-readable storage medium storing a non-volatile program code executable by a processor, where instructions included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, which is not described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (48)

1. A text error detection method, comprising:
based on the corpus in which the correct text is stored, screening suspected wrong words and suspected wrong characters from the text to be detected;
obtaining the vocabulary to which each suspected error character belongs from the text to be detected, and screening the vocabulary to which the suspected error character belongs from the vocabulary to which the suspected error character belongs to obtain a target suspected vocabulary;
and screening target error characters from the target suspected vocabulary based on the probability of each target suspected vocabulary appearing at the current position of the text to be detected.
2. The method of claim 1, wherein the screening the target suspected vocabulary for the target wrong character comprises:
screening target wrong words from the target suspected words based on the probability of each target suspected word appearing at the current position of the text to be detected;
and screening characters belonging to the suspected error characters from all characters of the target error vocabulary to obtain the target error characters.
3. The method of claim 2, wherein the screening the target suspected vocabulary for the target wrong vocabulary comprises:
determining the probability of each target suspected vocabulary appearing at the current position of the text to be detected to obtain a first probability of each target suspected vocabulary;
and screening the target suspected vocabulary with the first probability smaller than a first preset value to obtain the target error vocabulary.
4. The method of claim 3, wherein determining the first probability comprises:
acquiring a previous word of the target suspected word in the text to be detected to obtain a first word;
acquiring a latter vocabulary of the target suspected vocabulary in the text to be detected to obtain a second vocabulary;
and determining the probability of the common occurrence of the target suspected vocabulary, the first vocabulary and the second vocabulary according to the probability of the common occurrence of every two vocabularies in the corpus to obtain the first probability.
5. The method according to claim 4, wherein the probability of each two words in the corpus occurring together is a probability that one word will appear after the other word;
determining the first probability includes determining that the first probability includes,
and determining the probability that the target suspected vocabulary appears behind the first vocabulary and the second vocabulary appears behind the target suspected vocabulary according to the probability that every two vocabularies in the corpus appear together.
6. The method according to any one of claims 1 to 5, wherein the screening of the text to be detected for suspected wrong words and characters comprises:
acquiring the corpus and a text to be detected;
and screening suspected wrong vocabularies and suspected wrong characters from the text to be detected based on the probability of the common occurrence of every two characters and the probability of the common occurrence of every two vocabularies in the corpus.
7. The method of claim 6, wherein the screening the text to be detected for suspected erroneous characters comprises:
acquiring M characters of the text to be detected from the Nth character to obtain at least one first set; wherein N is more than or equal to 1, and N is less than or equal to L-M-1, and L represents the number of characters of the text of the object to be detected;
determining the probability of the co-occurrence of the M characters in the first set according to the probability of the co-occurrence of every two characters in the corpus to obtain a second probability;
and screening the suspected wrong characters from the first set based on the second probability.
8. The method of claim 7, wherein the screening the first set for the suspected erroneous character comprises:
and screening a first set with the second probability smaller than a second preset value, and acquiring all characters from the screened first set to obtain the suspected wrong characters.
9. The method according to claim 7, wherein the probability that every two characters in the corpus occur together is, in particular, the probability that one character occurs after the other character;
determining the second probability includes determining that the second probability includes,
and determining the probability of the co-occurrence of the M characters in the first set according to a first predetermined sequence according to the probability of the co-occurrence of every two characters in the corpus to obtain the second probability, wherein the first predetermined sequence is used for representing the sequence of the M characters in the first set in the text to be detected.
10. The method of claim 6, wherein screening the text to be detected for suspected errors comprises:
obtaining Q vocabularies of the text to be detected from the P-th vocabulary, and obtaining at least one second set; p is more than or equal to 1 and less than or equal to K-Q-1, and K represents the number of words in the text of the object to be detected;
determining the probability of the common occurrence of Q words in the second set according to the probability of the common occurrence of every two words in the corpus to obtain a third probability;
and screening the suspected wrong vocabulary from the second set based on the third probability.
11. The method of claim 10, wherein said screening said suspected erroneous vocabulary from said second set comprises:
and screening a second set with the third probability smaller than a third preset value, and acquiring all words from the screened second set to obtain the suspected wrong words.
12. The method according to claim 10, wherein the probability of each two words in the corpus occurring together is a probability that one word will appear after the other word;
determining the third probability includes determining that the third probability includes,
and determining the probability of the Q words in the third set appearing together according to a second preset sequence according to the probability of the two words in the corpus appearing together to obtain the third probability, wherein the second preset sequence is the sequence of the Q words in the second set in the text to be detected.
13. The method of claim 6, further comprising:
preprocessing all texts in the corpus;
determining the probability of common occurrence of every two characters in the preprocessed corpus;
determining the probability of the common occurrence of every two vocabularies in the preprocessed corpus;
and screening suspected wrong vocabularies and suspected wrong characters from the text to be detected based on the probability of common occurrence of every two characters and the probability of common occurrence of every two vocabularies in the preprocessed corpus.
14. The method of claim 13, wherein determining the probability of the co-occurrence of every two characters in the preprocessed corpus comprises:
and determining the co-occurrence frequency of every two characters in the preprocessed corpus, and determining the co-occurrence frequency of every two characters according to the obtained frequency.
15. The method of claim 14, wherein the frequency with which each two characters occur together is in particular the frequency with which one character occurs after the other character.
16. The method of claim 14, further comprising:
and adding the co-occurrence frequency of every two characters in the preprocessed corpus to a fourth preset value to obtain the updated co-occurrence frequency of every two characters.
17. The method of claim 13, wherein determining the probability of co-occurrence of every two words in the preprocessed corpus comprises:
and determining the co-occurrence frequency of every two words in the preprocessed corpus, and determining the co-occurrence frequency of every two words according to the obtained frequency.
18. The method of claim 17, wherein each of the two words co-occur with a frequency that is specific to a frequency that one of the words occurs later than the other word.
19. The method of claim 17, further comprising:
and adding the co-occurrence frequency of every two words in the preprocessed corpus to a fifth preset value to obtain the updated co-occurrence frequency of every two words.
20. The method of claim 13, wherein the corpus comprises at least one text;
the preprocessing of all texts in the corpus comprises:
adding a first predetermined character before a first character of each text;
adding a second predetermined character after the last character of each text;
replacing all characters except the Chinese character in each text with a third preset character;
replacing a consecutive plurality of third predetermined characters with one of said third predetermined characters.
21. The method of claim 20, further comprising:
performing word segmentation on the preprocessed text to obtain a plurality of words of the corpus;
and determining the probability of the common occurrence of every two words in the preprocessed corpus based on the words in the corpus.
22. The method according to claim 20 or 21, characterized in that it further comprises a step of preprocessing the text to be detected:
adding the first preset character before the first character of the text to be detected;
adding the second predetermined character after the last character of the text to be detected;
replacing all characters except the Chinese character in the text to be detected with a third preset character;
replacing a consecutive plurality of third predetermined characters with one of said third predetermined characters.
23. The method of claim 22, further comprising:
performing word segmentation processing on the text to be detected to obtain a plurality of words of the text to be detected;
and screening the suspected wrong vocabulary based on the vocabulary of the text to be detected and the corpus.
24. A text error detection apparatus, comprising:
the first screening module is used for screening suspected wrong words and suspected wrong characters from the text to be detected based on the corpus in which the correct text is stored;
the second screening module is used for acquiring the vocabulary to which each suspected wrong character belongs from the text to be detected, and screening the vocabulary to which the suspected wrong character belongs from the vocabulary to which the suspected wrong character belongs to obtain a target suspected vocabulary;
and the third screening module is used for screening target wrong characters from the target suspected vocabulary based on the probability that each target suspected vocabulary appears at the current position of the text to be detected.
25. The apparatus of claim 24, wherein the third screening module comprises:
the target wrong vocabulary screening submodule is used for screening target wrong vocabularies from the target suspected vocabularies based on the probability that each target suspected vocabulary appears at the current position of the text to be detected;
and the target error character screening submodule is used for screening characters belonging to the suspected error characters from all characters of the target error vocabulary to obtain the target error characters.
26. The apparatus according to claim 25, wherein the target wrong vocabulary screening sub-module is specifically configured to determine a probability that each target suspected vocabulary appears at the current position of the text to be detected, obtain a first probability of each target suspected vocabulary, and screen the target suspected vocabulary having the first probability smaller than a first predetermined value, so as to obtain the target wrong vocabulary.
27. The apparatus according to claim 26, wherein the target wrong vocabulary screening sub-module, when determining the first probability, is further configured to obtain a previous vocabulary of the target suspected vocabulary in the text to be detected, obtain a first vocabulary, obtain a next vocabulary of the target suspected vocabulary in the text to be detected, obtain a second vocabulary, and determine a probability that each two vocabularies in the corpus co-occur, where the target suspected vocabulary, the first vocabulary, and the second vocabulary co-occur, according to a probability that each two vocabularies in the corpus co-occur, so as to obtain the first probability.
28. The apparatus according to claim 27, wherein the probability of each two words in the corpus occurring together is a probability that one word will appear after the other word;
when determining the first probability, the target wrong vocabulary screening submodule is further specifically configured to determine, according to a probability that every two vocabularies in the corpus occur together, a probability that the target suspected vocabulary appears behind the first vocabulary and a probability that the second vocabulary appears behind the target suspected vocabulary.
29. The apparatus of any one of claims 24 to 28, wherein the first screening module comprises:
the acquisition submodule is used for acquiring the corpus and the text to be detected;
and the suspected error screening submodule is used for screening suspected error vocabularies and suspected error characters from the text to be detected based on the probability that every two characters in the corpus appear together and the probability that every two vocabularies appear together.
30. The apparatus according to claim 29, wherein the suspected error screening submodule is specifically configured to obtain M characters starting from an nth character of the text to be detected, so as to obtain at least one first set; determining the probability of the co-occurrence of the M characters in the first set according to the probability of the co-occurrence of every two characters in the corpus to obtain a second probability, and screening the suspected wrong characters from the first set based on the second probability; wherein N is greater than or equal to 1, and N is less than or equal to L-M-1, and L represents the number of characters of the text of the object to be detected.
31. The apparatus according to claim 30, wherein the suspected-error filtering sub-module is further configured to, when filtering the suspected-error characters from the first set, filter the first set with a second probability smaller than a second predetermined value, and obtain all the characters from the filtered first set to obtain the suspected-error characters.
32. The apparatus according to claim 30, wherein the probability that every two characters in the corpus occur together is, in particular, the probability that one character occurs after the other character;
when determining the second probability, the suspected error screening submodule is further specifically configured to determine, according to the probability that every two characters in the corpus occur together, the probability that the M characters in the first set occur together according to a first predetermined sequence, so as to obtain the second probability, where the first predetermined sequence is used to indicate the sequence of the M characters in the first set in the text to be detected.
33. The apparatus according to claim 29, wherein the suspected error screening sub-module is further configured to obtain Q vocabularies of the text to be detected, starting from the pth vocabulary, to obtain at least one second set; determining the probability of the common occurrence of Q words in the second set according to the probability of the common occurrence of every two words in the corpus to obtain a third probability; screening the suspected wrong vocabulary from the second set based on the third probability; wherein P is more than or equal to 1 and less than or equal to K-Q-1, and K represents the number of words in the text of the object to be detected.
34. The apparatus according to claim 33, wherein the suspected-error filtering sub-module is further configured to filter a second set with a third probability smaller than a third predetermined value when the suspected-error words are filtered from the second set, and obtain all words from the filtered second set to obtain the suspected-error words.
35. The apparatus according to claim 33, wherein the probability of each two words in the corpus occurring together is a probability that one word will appear after the other word;
when determining the third probability, the suspected error screening submodule is further specifically configured to determine, according to a probability that every two words in the corpus occur together, a probability that Q words in the third set occur together according to a second predetermined sequence, so as to obtain the third probability, where the second predetermined sequence is a sequence of the Q words in the second set in the text to be detected.
36. The apparatus of claim 29, wherein the first screening module further comprises:
the preprocessing submodule is used for preprocessing all texts in the corpus;
the first probability determination submodule is used for determining the probability of the common occurrence of every two characters in the preprocessed corpus;
the second probability determination submodule is used for determining the probability that every two words in the preprocessed corpus appear together;
the suspected error screening submodule is further used for screening suspected error vocabularies and suspected error characters from the text to be detected based on the probability that every two characters in the preprocessed corpus appear together and the probability that every two vocabularies appear together.
37. The apparatus of claim 36, wherein the first probability determination submodule is configured to determine a frequency of occurrence of each two characters in the preprocessed corpus, and to determine the frequency of occurrence of each two characters according to the determined frequency.
38. The apparatus of claim 37, wherein the frequency of occurrence of each two characters is specifically a frequency of occurrence of one character after the other character.
39. The apparatus of claim 37, wherein the first probability determination sub-module is further configured to add the frequency of occurrence of each two characters in the preprocessed corpus to a fourth predetermined value to obtain an updated frequency of occurrence of each two characters.
40. The apparatus according to claim 36, wherein the second probability determination sub-module is configured to determine a frequency of occurrence of each two words in the preprocessed corpus, and determine a frequency of occurrence of each two words according to the determined frequency.
41. The apparatus of claim 40 wherein each of said two words co-occurs with a frequency such that one of said words occurs later in time than the other.
42. The apparatus according to claim 40, wherein the second probability determination sub-module is further configured to add a fifth predetermined value to the frequency of occurrence of each two words in the preprocessed corpus to obtain an updated frequency of occurrence of each two words.
43. The apparatus of claim 36, wherein the corpus comprises at least one text;
the preprocessing submodule is specifically configured to add a first predetermined character before a first character of each text, add a second predetermined character after a last character of each text, replace all characters except a chinese character in each text with third predetermined characters, and replace a plurality of consecutive third predetermined characters with one third predetermined character.
44. The apparatus according to claim 43, wherein the preprocessing sub-module is further configured to perform word segmentation on the preprocessed text to obtain a plurality of words in the corpus;
the second probability determination submodule is further used for determining the probability of the common occurrence of every two words in the preprocessed corpus based on the words in the corpus.
45. The method according to claim 43 or 44, wherein the preprocessing sub-module is further configured to preprocess the text to be detected, and the preprocessing sub-module is further configured to add the first predetermined character before a first character of the text to be detected, add the second predetermined character after a last character of the text to be detected to replace all characters except a Han character in the text to be detected with a third predetermined character, and replace a plurality of consecutive third predetermined characters with one third predetermined character.
46. The method according to claim 45, wherein the preprocessing submodule is further configured to perform word segmentation on the text to be detected to obtain a plurality of words of the text to be detected;
the suspected error screening submodule is further specifically configured to screen the suspected error vocabulary based on the vocabulary of the text to be detected and the corpus.
47. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the text error detection method of any of claims 1 to 23.
48. A computer-readable storage medium, having stored thereon a computer program for performing, when being executed by a processor, the steps of a text error detection method according to any one of claims 1 to 23.
CN201811006028.0A 2018-08-30 2018-08-30 Text error detection method and device Active CN110929502B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811006028.0A CN110929502B (en) 2018-08-30 2018-08-30 Text error detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811006028.0A CN110929502B (en) 2018-08-30 2018-08-30 Text error detection method and device

Publications (2)

Publication Number Publication Date
CN110929502A true CN110929502A (en) 2020-03-27
CN110929502B CN110929502B (en) 2023-08-25

Family

ID=69854939

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811006028.0A Active CN110929502B (en) 2018-08-30 2018-08-30 Text error detection method and device

Country Status (1)

Country Link
CN (1) CN110929502B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859921A (en) * 2020-07-08 2020-10-30 金蝶软件(中国)有限公司 Text error correction method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065384A (en) * 2009-09-16 2011-03-31 Nippon Telegr & Teleph Corp <Ntt> Text analysis device, method, and program coping with wrong letter and omitted letter
CN106528532A (en) * 2016-11-07 2017-03-22 上海智臻智能网络科技股份有限公司 Text error correction method and device and terminal
CN107729316A (en) * 2017-10-12 2018-02-23 福建富士通信息软件有限公司 The identification of wrong word and the method and device of error correction in the interactive question and answer text of Chinese

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065384A (en) * 2009-09-16 2011-03-31 Nippon Telegr & Teleph Corp <Ntt> Text analysis device, method, and program coping with wrong letter and omitted letter
CN106528532A (en) * 2016-11-07 2017-03-22 上海智臻智能网络科技股份有限公司 Text error correction method and device and terminal
CN107729316A (en) * 2017-10-12 2018-02-23 福建富士通信息软件有限公司 The identification of wrong word and the method and device of error correction in the interactive question and answer text of Chinese

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859921A (en) * 2020-07-08 2020-10-30 金蝶软件(中国)有限公司 Text error correction method and device, computer equipment and storage medium
CN111859921B (en) * 2020-07-08 2024-03-08 金蝶软件(中国)有限公司 Text error correction method, apparatus, computer device and storage medium

Also Published As

Publication number Publication date
CN110929502B (en) 2023-08-25

Similar Documents

Publication Publication Date Title
CN111079412B (en) Text error correction method and device
CN106528532B (en) Text error correction method, device and terminal
CN108287858B (en) Semantic extraction method and device for natural language
US9564127B2 (en) Speech recognition method and system based on user personalized information
CN105653517A (en) Recognition rate determining method and apparatus
CN111651978A (en) Entity-based lexical examination method and device, computer equipment and storage medium
CN108345468A (en) Programming language code duplicate checking method based on tree and sequence similarity
CN114861635B (en) Chinese spelling error correction method, device, equipment and storage medium
CN111046627B (en) Chinese character display method and system
CN110929502B (en) Text error detection method and device
CN111339756B (en) Text error detection method and device
CN101866336A (en) Methods, devices and systems for obtaining evaluation unit and establishing syntactic path dictionary
CN107092590A (en) A kind of sentence segmenting method and system
CN111259654B (en) Text error detection method and device
CN106776590A (en) A kind of method and system for obtaining entry translation
CN111291535A (en) Script processing method and device, electronic equipment and computer readable storage medium
CN109002454B (en) Method and electronic equipment for determining spelling partition of target word
CN113609864B (en) Text semantic recognition processing system and method based on industrial control system
CN113609279B (en) Material model extraction method and device and computer equipment
CN109511000A (en) Barrage classification determines method, apparatus, equipment and storage medium
CN115169328A (en) High-accuracy Chinese spelling check method, system and medium
CN115019295A (en) Model training method, text line determination method and text line determination device
CN114528824A (en) Text error correction method and device, electronic equipment and storage medium
CN113435217A (en) Language test processing method and device and electronic equipment
CN111985233A (en) Semantic association type Chinese proofreading database and Chinese proofreading system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant