CN111428474A

CN111428474A - Language model-based error correction method, device, equipment and storage medium

Info

Publication number: CN111428474A
Application number: CN202010164817.8A
Authority: CN
Inventors: 刘东煜; 曾增烽
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-07-17

Abstract

The invention relates to the technical field of artificial intelligence, and discloses an error correction method based on a language model.A first word sequence and a second word sequence are obtained by performing the same word segmentation processing twice on text data through the language model, the segmentation probability of a keyword is calculated on the basis of the two word sequences, the text data is newly segmented on the basis of the probability to obtain a third word sequence, the pinyin feature conversion of the keyword is performed on the basis of the third word sequence, the recall processing is performed to obtain candidate words, and a keyword meeting the condition is selected from the candidate words to perform error correction operation on the corresponding keyword in the text data; the invention also provides an error correction device, equipment and a storage medium based on the language model, so that the segmentation and identification of the keywords are effectively improved, and correct keywords are selected for replacement based on the combination of recall and probability, so that the detection difficulty is reduced, and the detection accuracy and efficiency are improved.

Description

Language model-based error correction method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an error correction method, device, equipment and storage medium based on a language model.

Background

In the error correction, the position of an error is often detected, then a candidate word corresponding to the position is recalled according to the position of the error, and finally the error correction is performed.

At present, there are many methods for detecting errors, and the commonly used methods include a sequence labeling method, a dictionary matching method and a traditional speech model card threshold value method; for the method using sequence labeling, a large amount of manually labeled corpora are needed for supervised training; for the dictionary matching method, the method is very dependent on the quality of the dictionary, and meanwhile, the method is difficult to ensure that words which are not in the dictionary have errors; however, it is difficult to determine the threshold value by using the conventional speech model, and it can be seen that in the prior art, the implementation manner of detecting and correcting the wrong keyword is still relatively complicated and the accuracy is not high.

Disclosure of Invention

The invention mainly aims to provide an error correction method, device, equipment and storage medium based on a language model, and aims to solve the technical problems that the detection process for text error correction is complex and the error correction accuracy rate is low in the prior art.

To solve the above-mentioned problems, in a first aspect of the present invention, there is provided a language model-based error correction method, including: acquiring text data to be corrected, and performing word segmentation processing on the text data according to a segmentation model in the language model to obtain a first word sequence and a second word sequence; calculating the probability of each keyword in the first word sequence and the second word sequence being segmented by the language model, and performing secondary word segmentation processing on the text data according to the probability to obtain a third word sequence; calculating the word frequency of each keyword in the third word sequence, wherein the word frequency is obtained by querying a preset dictionary statistic through the keywords; judging whether the word frequency reaches a preset word frequency threshold value; if not, converting the corresponding first keyword into pinyin characteristics through the language model; and recalling candidate words corresponding to the first keyword by using a preset recall model according to the pinyin characteristics, and selecting correct words from the candidate words to correct the first keyword.

Optionally, in a feasible implementation manner of the first aspect of the present invention, the obtaining a first word sequence and a second word sequence by performing word segmentation processing on the text data according to a segmentation model in the language model includes: matching the field with the most continuous characters in the text data with a preset word segmentation table based on the left-to-right direction according to a forward maximum matching algorithm, and segmenting a keyword if the field exists in the word segmentation table in a matching manner until the text data is segmented completely to obtain the first word sequence; and matching the field with the most continuous characters in the text data with a preset word segmentation table based on a direction from right to left according to a reverse maximum matching algorithm, and segmenting a keyword if the field exists in the word segmentation table, until the text data is segmented completely, so as to obtain the second word sequence.

Optionally, in a feasible implementation manner of the first aspect of the present invention, the calculating a probability of occurrence of each keyword in the first word sequence and the second word sequence when segmenting words in the language model, and performing secondary word segmentation processing on the text data according to the probability to obtain a third word sequence includes: determining a same second keyword and a different third keyword in the first word sequence and the second word sequence; respectively calculating the probability that third key words positioned at the left side and the right side of the second key words in the first word sequence and the second word sequence are segmented in the segmentation process of the language model on the historical text data, and calculating the total probability of all the key words at the left side and the right side; selecting word sequence segments with high total probability from the left side of second keywords in the first word sequence and the second word sequence, and selecting word sequence segments with high total probability from the right side of the second keywords in the first word sequence and the second word sequence; and forming a new word sequence based on the word sequence segments selected from the left side and the right side of the second keyword and the second keyword to obtain the third word sequence.

Optionally, in a possible implementation manner of the first aspect of the present invention, the preset dictionary at least includes a keyword and a homophone keyword of the keyword, and the calculating the word frequency of each keyword in the third word sequence includes: acquiring an index corresponding to each keyword in the third word sequence according to the corresponding relation between the keyword and the index; inquiring corresponding keywords in the preset dictionary and homophonic keywords corresponding to the keywords according to the index; and calculating the ratio of the keywords to the sum of the keywords and the homophonic keywords to obtain the word frequency of each keyword in the third word sequence.

Optionally, in a possible implementation manner of the first aspect of the present invention, the converting the corresponding first keyword into the pinyin feature through the language model includes: performing pronunciation training on the first keyword according to an acoustic model in the language model to obtain a pronunciation syllable of the first keyword; and converting the pronunciation syllables into corresponding pinyin characters according to a coding model in the language model, sequencing the pinyin characters according to the pronunciation sequence, and outputting corresponding pinyin characteristics.

Optionally, in a possible implementation manner of the first aspect of the present invention, the recalling, according to the pinyin feature, the candidate words corresponding to the first keyword using a preset recall model, and selecting a correct word from the candidate words to correct the error of the first keyword includes: according to the pinyin features, keywords with the same pinyin features are inquired from a homophone dictionary to form a homophone recall set; inquiring a corresponding inverted index according to each single character in the first keyword; according to the inverted index, corresponding keywords are inquired from a preset dictionary to form an inverted recall set; and calculating the intersection of the homophonic recall set and the inverted recall set to obtain an error correction keyword of the first keyword, and replacing the first keyword.

Optionally, in a possible implementation manner of the first aspect of the present invention, after the calculating an intersection between the homophonic recall set and the reverse-ranking recall set to obtain an error correction keyword of the first keyword, before replacing the first keyword with the error correction keyword, the method further includes: calculating the editing distance between the first keyword and each keyword in the intersection, wherein the editing distance refers to the minimum editing operation times required for converting the first keyword into the keywords in the intersection, and the keyword with the shortest editing distance in the intersection is the keyword with the highest similarity to the first keyword; and selecting the keyword with the minimum editing distance as an error correction keyword.

Further, in order to solve the above-mentioned problems, in a second aspect of the present invention, there is provided a language model-based error correction apparatus comprising: the first segmentation module is used for acquiring text data to be corrected and performing word segmentation processing on the text data according to a segmentation model in the language model to obtain a first word sequence and a second word sequence; the second segmentation module is used for calculating the probability that each keyword in the first word sequence and the second word sequence is segmented by the language model, and performing secondary word segmentation processing on the text data according to the probability to obtain a third word sequence; the word frequency counting module is used for calculating the word frequency of each keyword in the third word sequence, wherein the word frequency is obtained by the keyword inquiry of preset dictionary statistics; the judging module is used for judging whether the word frequency reaches a preset word frequency threshold value; the conversion module is used for converting the corresponding first keyword into pinyin characteristics through the language model when the word frequency does not reach a preset word frequency threshold; and the error correction module is used for recalling the candidate words corresponding to the first keyword by using a preset recall model according to the pinyin characteristics and selecting correct words from the candidate words to correct the first keyword.

Optionally, in a possible implementation manner of the second aspect of the present invention, the first segmentation module includes a forward segmentation unit and a reverse segmentation unit; the forward segmentation unit is used for matching the field with the maximum number of continuous characters in the text data with a preset word segmentation table based on the left-to-right direction according to a forward maximum matching algorithm, and segmenting a keyword if the field exists in the word segmentation table, until the text data is segmented completely, so as to obtain the first word sequence; and the reverse segmentation unit is used for matching the field with the most continuous characters in the text data with a preset word segmentation table based on a direction from right to left according to a reverse maximum matching algorithm, and segmenting a keyword if the field exists in the word segmentation table, until the text data is segmented completely, so as to obtain the second word sequence.

Optionally, in a possible implementation manner of the second aspect of the present invention, the second segmentation module includes a determination unit, a calculation unit, an extraction unit, and a combination unit; the determining unit is used for determining a second keyword and a third keyword which are the same and different in the first word sequence and the second word sequence; the calculation unit is used for calculating the probabilities of the third key words positioned at the left side and the right side of the second key words in the first word sequence and the second word sequence, which are segmented in the segmentation process of the language model on the historical text data, and calculating the total probabilities of all the key words at the left side and the right side; the extraction unit is used for selecting word sequence segments with high total probability from the left sides of second keywords in the first word sequence and the second word sequence and selecting word sequence segments with high total probability from the right sides of the second keywords in the first word sequence and the second word sequence; the combining unit is configured to obtain the third word sequence based on word sequence segments selected from the left and right sides of the second keyword and a new word sequence composed of the second keyword.

Optionally, in a possible implementation manner of the second aspect of the present invention, the word frequency statistics module includes a query unit and a word frequency calculation unit; the query unit is used for acquiring an index corresponding to each keyword in the third word sequence according to the corresponding relation between the keyword and the index; inquiring corresponding keywords in the preset dictionary and homophonic keywords corresponding to the keywords according to the index; the word frequency calculating unit is used for calculating the ratio of the keywords to the sum of the keywords and the homophonic keywords to obtain the word frequency of each keyword in the third word sequence.

Optionally, in a possible implementation manner of the second aspect of the present invention, the conversion module includes a pronunciation training unit and a coding unit; the pronunciation training unit is used for carrying out pronunciation training on the first keyword according to an acoustic model in the language model to obtain a pronunciation syllable of the first keyword; the coding unit is used for converting the pronunciation syllables into corresponding pinyin characters according to a coding model in the language model, sequencing the pinyin characters according to the pronunciation sequence and outputting corresponding pinyin characteristics.

Optionally, in a possible implementation manner of the second aspect of the present invention, the error correction module includes a homophonic query unit, an index query unit, and a recall unit; the homophonic query unit is used for querying key words with the same pinyin characteristics from a homophonic dictionary according to the pinyin characteristics to form a homophonic recall set; the index query unit is used for querying a corresponding inverted index according to each single character in the first key words; according to the inverted index, corresponding keywords are inquired from a preset dictionary to form an inverted recall set; the recall unit is used for calculating the intersection of the homophonic recall set and the inverted recall set, obtaining the error correction keyword of the first keyword, and replacing the first keyword.

Optionally, in a feasible implementation manner of the second aspect of the present invention, the error correction module further includes a distance calculation unit, configured to perform an edit distance calculation on the first keyword and each keyword in the intersection, where an edit distance refers to a minimum number of edit operations required to convert the first keyword into a keyword in the intersection, and a keyword in the intersection with a shortest edit distance is a keyword with a highest similarity to the first keyword; and selecting the keyword with the minimum editing distance as an error correction keyword.

Further, to solve the above-mentioned problems, in a third aspect of the present invention, there is provided an error correction apparatus based on a language model, the error correction apparatus including: a memory, a processor, and a computer readable program stored on the memory and executable on the processor, the computer readable program when executed by the processor implementing the language model based error correction method as recited in any of the above.

Further, to solve the above-mentioned problems, in a fourth aspect of the present invention, there is provided a computer-readable storage medium having stored thereon a computer-readable program which, when executed by one or more processors, implements the language model-based error correction method as described in any one of the above.

The invention provides an error correction method, device, equipment and storage medium based on a language model, wherein the word frequency of each word in a corpus is calculated, whether the word has errors is judged based on the word frequency, and the judgment principle is that whether the wrong word exists in the corpus to be corrected based on the characteristic that the correct expression is far larger than the incorrect expression, so that the word is recalled and error correction is carried out, the detection difficulty is reduced, and the detection accuracy and efficiency are improved.

Drawings

Fig. 1 is a schematic structural diagram of a terminal provided in the present invention;

FIG. 2 is a schematic flow chart of a first embodiment of an error correction method according to the present invention;

FIG. 3 is a flowchart illustrating a second embodiment of an error correction method according to the present invention;

FIG. 4 is a diagram of a mobile box segmentation keyword according to the present invention;

FIG. 5 is another diagram of a mobile box segmentation keyword according to the present invention;

FIG. 6 is a schematic diagram of forward and backward maximum matching algorithm word segmentation provided by the present invention;

FIG. 7 is a functional block diagram of an embodiment of an error correction apparatus according to the present invention;

fig. 8 is a functional block diagram of another implementation of the error correction apparatus provided in the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The embodiment of the invention provides an error correction scheme based on a language model, which is mainly characterized in that the word frequency of each word in a corpus is calculated, whether the word has errors is judged based on the word frequency, and the judgment principle is that whether the wrong word exists in the corpus to be corrected based on the characteristic that the correct expression is far larger than the incorrect expression, so that the word is recalled and error correction is carried out, the detection difficulty is reduced, and the detection accuracy and efficiency are improved.

Fig. 2 is a flowchart of an error correction method according to an embodiment of the present invention, where the method is mainly used to implement that, when a corpus is input, error information in the corpus can be quickly identified and corrected, and meanwhile, the method can also be applied to check and correct some documents, and the error correction method specifically includes the following steps:

201, acquiring text data to be corrected, and performing word segmentation processing on the text data according to a segmentation model in the language model to obtain a first word sequence and a second word sequence;

in this step, the text data may be text information currently input by the user, or text information after voice conversion according to the input by the user, or even text information of some working templates related to insurance services.

In this embodiment, the language model is a model obtained by training a plurality of different algorithms, and the specific language model can realize the function of word segmentation and the extraction and conversion of pinyin features.

In practical application, when segmenting the text data, the text data is segmented specifically by a segmentation rule of a segmentation model, for example, two bytes are segmented once, the two bytes can be segmented by a space key, and can also be segmented by a slash, further, after the segmentation is completed, the segmentation rule further includes the judgment of word segmentation on the segmented words, whether the segmented words can form words or not and belong to the conventional words in the service scene, if not, adjacent segmented words are recombined and re-segmented, and certainly, the segmentation rule can also be segmented in a manner of 2-3-2 and the like.

In this embodiment, the segmentation model may also be a segmentation model obtained by training a bidirectional maximum matching algorithm, where the bidirectional maximum matching algorithm is a set of two algorithms, and mainly includes a forward maximum matching algorithm and a reverse maximum matching algorithm.

202, calculating the probability of each keyword in the first word sequence and the second word sequence being segmented by the language model, and performing secondary word segmentation processing on the text data according to the probability to obtain a third word sequence;

in this step, after obtaining two word sequences with different division results, the maximum probability of each keyword in the two word sequences that may appear is calculated according to the division probability calculation formula of the keyword, the sum of the probabilities of the keywords in each word sequence is determined, whether the sum is greater than the probability of correctly dividing the words is determined, if yes, step S30 is executed to directly calculate the word frequency of the keyword in the word sequence, if not, the keyword of the text data is re-divided, and in the re-dividing process, the keyword with the probability of the individual keyword greater than the probability of correctly dividing the words is retained, and other keywords that are not satisfied are recombined and divided.

In this embodiment, in the secondary word segmentation process, the re-segmented keywords are matched with the word list in the word segmentation process, and if the words are not satisfied, the segmentation length is extended.

203, calculating the word frequency of each keyword in the third word sequence;

in this embodiment, the word frequency is obtained by querying a preset dictionary statistic through the keyword, the preset dictionary is a keyword dictionary obtained by learning from a corpus of a user or a robot conversation observed in advance according to different service scenes, or standard text data of different service types are stored in the dictionary, the occurrence frequency of the keyword in the standard text data is counted, and the occurrence probability is calculated based on the parameter and the total number of times of the standard text data, and is the word frequency.

204, judging whether the word frequency reaches a preset word frequency threshold value;

in this embodiment, for different service scenarios, errors of some words may be different, for example, in an insurance service scenario, "safety good" is a correct expression, and "safety symbol" is an incorrect expression, but in a conventional use, there are many "safety symbols", for which, by judging whether the word frequency of the "safety symbol" in the insurance scenario meets the requirement, if not, it is determined that the word frequency is an incorrect keyword, and error correction processing is required.

For example, the word "right crus vein embolism and insurance" is divided into "right crus/vein/embolism/insurance"; because the "tie" word is wrong, a 2gram word results: the frequency of 'venous embolism' and 'embolism and protection' is less than 10; but it correctly states that "venous embolism" is 121, tens of times more than the error; if the frequency of the previous and subsequent 2 grams is low or the candidate frequency of the homophone with the current 2gram is high, it can be considered that the position has a high probability of error.

205, if not, converting the corresponding first keyword into pinyin characteristics through the language model;

in this step, the conversion is performed by specifically querying a dictionary, or may be performed by using a coding model.

206, recalling the candidate words corresponding to the first keyword by using a preset recall model according to the pinyin characteristics, and selecting correct words from the candidate words to correct the first keyword.

In the step, the recall specifically includes a homophonic recall and a reverse recall, and the homophonic recall is realized by searching homophonic characters or words according to a specific pinyin sequence of the keywords, further, the recall mainly comprises recalling candidate words and then selecting the keywords meeting the requirements of the text data from the candidate words.

By implementing the method, the text data is subjected to twice identical word segmentation processing through a language model to obtain a first word sequence and a second word sequence, the segmentation probability of the keywords is calculated on the basis of the two word sequences, the text data is newly segmented on the basis of the probability to obtain a third word sequence, the pinyin feature conversion of the keywords is performed on the basis of the third word sequence, recall processing is performed to obtain candidate words, a keyword meeting conditions is selected from the candidate words to perform error correction operation on the corresponding keyword in the text data, so that the segmentation and identification of the keywords are effectively improved, and meanwhile, a correct keyword is selected to be replaced on the basis of the combination of recall and the probability, so that the detection difficulty is reduced, and the detection accuracy and efficiency are improved.

Fig. 3 is a second implementation flow of the error correction method according to the embodiment of the present invention, where the implementation mainly expands candidate words according to pinyin features of suspected erroneous objects, and then selects a closest candidate word from the candidate words to implement replacement processing for error correction, where the closest candidate word refers to a candidate word having a maximum probability, and the implementation steps are as follows:

301, acquiring text data to be corrected;

in the step, the method further comprises the steps of copying the text data to obtain two text data copies, and performing word segmentation operation twice respectively based on the text data copies.

302, matching the field with the most continuous characters in the text data with a preset word segmentation table based on the left-to-right direction according to a forward maximum matching algorithm;

303, if the field exists in the word segmentation table in a matching manner, segmenting a keyword until the text data segmentation is completed to obtain the first word sequence;

in this embodiment, in the process of performing word segmentation by the forward maximum matching algorithm, first, a field length of a text frame is determined, a field is selected from a text data copy based on the field length to form a keyword sequence, whether a keyword exists is determined by querying from a word segmentation table based on the keyword sequence, if the keyword does not exist, the field length of the text frame is adjusted, and words are re-segmented on the text data copy based on the adjusted field length until all matches are known.

304, matching the field with the most continuous characters in the text data with a preset word segmentation table based on the direction from right to left according to a reverse maximum matching algorithm;

305, if the field exists in the word segmentation table in a matching way, segmenting a keyword until the text data segmentation is finished to obtain the second word sequence;

in this embodiment, the process of segmenting the text data copy by reverse matching is the same as the way of segmenting by forward matching, and is not repeated here.

In practical application, when text data is segmented by a forward and reverse maximum matching algorithm, a text box is set, specifically, a moving box is set, starting from a first word of the text data, continuous characters selected by the frame in the moving box are matched with a preset word list, if the continuous characters selected by the frame in the moving box are matched with the preset word list, the continuous characters selected by the frame in the moving box are segmented, and then the starting position of the moving box is adjusted to the next character of the continuous characters; if not, the length of one character is expanded for the moving frame, and continuous characters selected by the frame in the adjusted moving frame are matched with the word list again.

For example, first set the moving box to two character sizes, and then based on the text box for "did i catch a cold in these days and can a safety character? "i, i.e. selecting" i this "from the" i "character," i this "inquires whether the same keyword exists in the segmentation table, if not, the mobile cabinet frame is lengthened and set to three characters, until the segmentation table exists, i.e." i this few days, "the position behind the keyword (i.e. behind the" day "character) segmented by the mobile frame moving value, the field length of the mobile frame is recovered to segment the remaining text data, as shown in fig. 5.

306, calculating the probability of each keyword in the first word sequence and the second word sequence being segmented by the language model, and performing secondary word segmentation processing on the text data according to the probability to obtain a third word sequence;

in this step, the process of forming the third word sequence may be specifically implemented by the following steps:

determining a same second keyword and a different third keyword in the first word sequence and the second word sequence;

respectively calculating the probability that third key words positioned at the left side and the right side of the second key words in the first word sequence and the second word sequence are segmented in the segmentation process of the language model on the historical text data, and calculating the total probability of all the key words at the left side and the right side;

selecting word sequence segments with high total probability from the left side of second keywords in the first word sequence and the second word sequence, and selecting word sequence segments with high total probability from the right side of the second keywords in the first word sequence and the second word sequence;

and forming a new word sequence based on the word sequence segments selected from the left side and the right side of the second keyword and the second keyword to obtain the third word sequence.

In practical application, as shown in fig. 6, for "our soldier wants to eat down" a bidirectional maximum algorithm is adopted, the division result of the forward maximum matching algorithm is "our/soldier/log/want/eat down", and the division result of the reverse maximum matching algorithm is "our/soldier/log/want/eat down/eat", the same participle and different participles are divided based on the two division results, different participles at two ends of the same participle are determined, the total probability of different participles at two ends of the same participle in each word sequence, namely the total probability of the participle sequence with the reference number of ①②③④ in the graph, then a participle sequence with higher probability is selected from the two ends respectively, and a new word sequence, namely a third word sequence is formed based on the selected participle sequence, for example, ① > ② < ④, then ① and ④ are selected to be recombined to form a new word sequence.

In this embodiment, after determining the keyword with a higher word segmentation probability, the word frequency of each keyword in the third word sequence calculation is determined, and whether the keyword has an error is determined according to the word frequency.

307, calculating the word frequency of each keyword in the third word sequence;

in this embodiment, when the preset dictionary at least includes the keyword and the homophone keyword of the keyword, the steps are specifically implemented as follows:

acquiring an index corresponding to each keyword in the third word sequence according to the corresponding relation between the keyword and the index;

inquiring corresponding keywords in the preset dictionary and homophonic keywords corresponding to the keywords according to the index;

and calculating the ratio of the keywords to the sum of the keywords and the homophonic keywords to obtain the word frequency of each keyword in the third word sequence.

In practical application, the dictionary and the index are obtained by learning historical dialogue linguistic data in advance, word segmentation is carried out on the linguistic data, a 2-gram frequency dictionary of the linguistic data after word segmentation is counted, and a homophone dictionary based on the pinyin level of the 2gram is obtained; an inverted dictionary based on 2gram word levels; such as "right crus vein embolism insurance", "right crus/vein/embolism/insurance" after word segmentation; counting the frequency of occurrence of one 2gram word "venous embolism" in the corpus: 121, a carrier; taking pinyin of 'vein embolism': jingmai shuan sai; grouping other 2gram words in the corpus, which are the same as the pinyin of the venous embolism, such as the meridian embolism; then, an inverted index is built for each word of "venous embolism", such as one of the words "hydrant- > (venous embolism, fire hydrant, suppository.).

308, judging whether the word frequency reaches a preset word frequency threshold value;

309, if not, carrying out pronunciation training on a corresponding first keyword according to an acoustic model in the language model to obtain a pronunciation syllable of the first keyword;

310, converting the pronunciation syllables into corresponding pinyin characters according to a coding model in the language model, sequencing the pinyin characters according to the pronunciation sequence, and outputting corresponding pinyin characteristics;

in this embodiment, the acoustic model is obtained by learning based on the correspondence between the chinese keywords and the pinyin, in practical applications, the face model and the chinese keywords are used to query the correspondence and extract the corresponding pinyin features, the corresponding homophones are queried based on the pinyin features, and finally the correct keywords are selected from the homophones for replacement and error correction.

311, according to the pinyin features, searching keywords with the same pinyin features from a preset homophone dictionary to form a homophone recall set;

312, querying a corresponding preset inverted index according to each individual character in the first keyword;

213, according to the inverted index, searching out the corresponding key words from the preset inverted index dictionary to form an inverted recall set;

and 314, calculating the intersection of the homophonic recall set and the inverted recall set to obtain an error correction keyword of the first keyword, and replacing the first keyword.

In this embodiment, when the corresponding homophonic keyword is queried, the retrieval may be specifically implemented by using a recall algorithm, and in practical application, the above homophonic dictionary and inverted dictionary are used to recall corresponding candidates:

homophonic recall: wrong "embolism and guarantee" - > sting mai shuan sai- > correct "venous embolism";

and (4) inverted recall: the correct "venous embolism" is recalled using the intersection of the inverted indices of (venous, plug).

In this embodiment, the text or the sentence is subjected to error correction detection and error correction processing in the manner described above, and the characteristic that a correct expression is larger than an incorrect expression is fully utilized based on the manner of comparing the word frequency, so that a strong theoretical basis is provided for error detection, and it is avoided that the algorithm of a black box in the conventional method, such as a sequence tagging model, does not need to rely on manually tagged data, and only a large batch of corpora in the existing field, such as customer problem corpora in life risk, needs to be counted to obtain word frequency information of word level ngram, so that migration to other fields is facilitated, and only some simple table lookup operations need to be performed through the manner, thereby greatly improving the efficiency.

In order to further improve the screening of the correct keywords, the embodiment further includes, before the error correction and replacement:

calculating the editing distance between the first keyword and each keyword in the intersection, wherein the editing distance refers to the minimum editing operation times required for converting the first keyword into the keywords in the intersection, and the keyword with the shortest editing distance in the intersection is the keyword with the highest similarity to the first keyword;

and selecting the keyword with the minimum editing distance as an error correction keyword.

For the above-mentioned editing operation, the editing operation may include: character replacement operations, character insertion operations, character deletion operations, and the like. For example, the transformation of a "statement cannot withstand" into a "life cannot withstand light" requires only three steps: the 'sound' is replaced by the 'life', the 'light' is replaced by the 'life', and the 'emotion' is replaced by the 'light'. Therefore, "light that life cannot endure" is taken as the word with the highest similarity to the statement of the situation that life cannot endure ".

In summary, whether the word frequency of the keyword is a wrong word is judged based on the word frequency by calculating the word frequency of the keyword, and if the word frequency is a wrong word, the error correction is recalled according to the pinyin, so that the detection difficulty is reduced, and the detection accuracy and efficiency are improved.

In order to solve the above problem, an embodiment of the present invention further provides an error correction apparatus based on a language model, as shown in fig. 7, the error correction apparatus includes:

the first segmentation module 701 is configured to obtain text data to be corrected, and perform word segmentation processing on the text data according to a segmentation model in the language model to obtain a first word sequence and a second word sequence;

a second segmentation module 702, configured to calculate probabilities that keywords in the first word sequence and the second word sequence are segmented by the language model, and perform secondary word segmentation processing on the text data according to the probabilities to obtain a third word sequence;

a word frequency statistics module 703, configured to calculate a word frequency of each keyword in the third word sequence, where the word frequency is obtained by querying a preset dictionary through the keyword;

a determining module 704, configured to determine whether the word frequency reaches a preset word frequency threshold;

a conversion module 705, configured to convert the corresponding first keyword into a pinyin feature through the language model when the word frequency does not reach a preset word frequency threshold;

and the error correction module 706 is configured to recall the candidate words corresponding to the first keyword according to the pinyin features by using a preset recall model, and select a correct word from the candidate words to correct the first keyword.

The device realizes error correction processing of phrases or fields of a text, carries out twice identical word segmentation processing on text data through a language model to obtain a first word sequence and a second word sequence, calculates the segmentation probability of keywords on the basis of the two word sequences, carries out new segmentation on the text data on the basis of the probability to obtain a third word sequence, carries out pinyin feature conversion on the keywords on the basis of the third word sequence, and carries out recall processing to obtain candidate words, selects one keyword meeting conditions from the candidate words to carry out error correction operation on the corresponding keyword in the text data, thereby effectively improving segmentation and identification of the keywords, and simultaneously selects a correct keyword to replace on the basis of combination of recall and probability to reduce detection difficulty and improve detection accuracy and efficiency.

As shown in fig. 8, an embodiment of the present invention further provides another error correction apparatus, where the apparatus includes:

In another embodiment of the present invention, the first segmentation module 701 includes a forward segmentation unit 7011 and a reverse segmentation unit 7012;

the forward segmentation unit 7011 is configured to match, according to a forward maximum matching algorithm, a field with the largest number of consecutive characters in the text data with a preset word segmentation table based on a left-to-right direction, and if the field is matched to exist in the word segmentation table, segment a keyword until the text data is segmented completely to obtain the first word sequence;

the reverse segmentation unit 7012 is configured to match, according to a reverse maximum matching algorithm, a field with the largest number of consecutive characters in the text data with a preset word segmentation table based on a direction from right to left, and segment a keyword if the field exists in the word segmentation table after matching, until the text data is segmented completely, to obtain the second word sequence.

Optionally, the second segmentation module 702 includes a determination unit 7021, a calculation unit 7022, an extraction unit 7023, and a combination unit 7024;

the determining unit 7021 is configured to determine a second keyword and a third keyword that are the same and different in the first word sequence and the second word sequence;

the calculating unit 7022 is configured to calculate probabilities that third keywords located on the left and right sides of the second keyword in the first word sequence and the second word sequence are segmented in the segmentation process of the language model on the historical text data, and calculate a total probability of all the keywords on the left and right sides;

the extracting unit 7023 is configured to select the word sequence with the high total probability from the left side of the second keyword in the first word sequence and the second word sequence, and select the word sequence with the high total probability from the right side of the second keyword in the first word sequence and the second word sequence;

the combining unit 7024 is configured to obtain the third word sequence based on word sequence segments selected from the left and right sides of the second keyword and a new word sequence formed by the second keyword.

In this embodiment, the word frequency statistics module 703 includes a query unit 7031 and a word frequency calculation unit 7032;

the querying unit 7031 is configured to obtain an index corresponding to each keyword in the third word sequence according to a correspondence between the keyword and the index; inquiring corresponding keywords in the preset dictionary and homophonic keywords corresponding to the keywords according to the index;

the word frequency calculating unit 7032 is configured to calculate a ratio of the keyword to a sum of the keyword and the homophonic keyword, and obtain a word frequency of each keyword in the third word sequence.

Optionally, in another possible implementation, the conversion module 705 includes a pronunciation training unit 7051 and a coding unit 7052;

the pronunciation training unit 7051 is configured to perform pronunciation training on the first keyword according to an acoustic model in the language model to obtain a pronunciation syllable of the first keyword;

the coding unit 7052 is configured to convert the pronunciation syllables into corresponding pinyin characters according to a coding model in the language model, sort the pinyin characters according to a pronunciation sequence, and output corresponding pinyin features.

Optionally, in another possible implementation manner of this embodiment, the error correction module 706 includes a homophonic query unit 7061, an index query unit 7062, and a recall unit 7063;

the homophonic query unit 7061 is configured to query, according to the pinyin features, keywords having the same pinyin features from a homophonic dictionary to form a homophonic recall set;

the index querying unit 7062 is configured to query the corresponding inverted index according to each single character in the first keyword; according to the inverted index, corresponding keywords are inquired from a preset dictionary to form an inverted recall set;

the recall unit 7063 is configured to calculate an intersection of the homophonic recall set and the inverted recall set, obtain an error correction keyword of the first keyword, and replace the first keyword.

In this embodiment, the error correction module 706 further includes a distance calculation unit 7064, configured to perform an edit distance calculation on the first keyword and each keyword in the intersection, where an edit distance refers to a minimum number of edit operations required to convert the first keyword into a keyword in the intersection, and a keyword in the intersection with a shortest edit distance is a keyword with a highest similarity to the first keyword; and selecting the keyword with the minimum editing distance as an error correction keyword.

The execution function and the execution flow corresponding to the function based on the apparatus are the same as the contents described in the above error correction method embodiment of the present invention, and therefore the contents of the embodiment of the error correction apparatus are not described in detail in this embodiment.

In the embodiment of the present invention, the implementation of the error correction apparatus may be specifically implemented in the form of a server, that is, the apparatus implementing the error correction method is set as a function on the server in the input method system.

The present invention also provides an error correction apparatus based on a language model, the error correction apparatus comprising: the method implemented by the computer readable program when the computer readable program is executed by the processor may refer to various embodiments of the error correction method based on the language model of the present invention, and therefore, redundant description is not repeated.

In practical applications, the error correction apparatus may be an existing terminal structure, and is usually a mobile terminal, where the error correction function is started by a scanning function of the mobile terminal, and the function is to implement the function of the error correction method by setting a computer readable program, as shown in fig. 1, a schematic structural diagram of an operating environment of the terminal according to an embodiment of the present invention is shown.

As shown in fig. 1, the terminal includes: a processor 101, e.g. a CPU, a communication bus 102, a user interface 103, a network interface 104, a memory 105. Wherein the communication bus 102 is used for enabling connection communication between these components. The user interface 103 may comprise a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the network interface 104 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface). The memory 105 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 105 may optionally also be a storage device separate from the aforementioned processor 101.

It will be understood by those skilled in the art that the hardware configuration of the terminal shown in fig. 1 does not constitute a limitation of the error correction apparatus and device of the present invention, and may include more or less components than those shown, or some components may be combined, or a different arrangement of components may be provided.

As shown in fig. 1, the memory 105, which is a computer-readable storage medium, may include therein an operating system, a network communication program module, a user interface program module, and computer-readable programs/instructions for implementing the mail forwarding method. Wherein the operating system is to schedule communication between modules in the terminal and execute computer readable programs/instructions stored in the memory, the error correction method in the above embodiments.

In the hardware configuration of the terminal shown in fig. 1, the network interface 104 is mainly used for accessing a network; the user interface 603 is mainly used for monitoring and acquiring mail data to be sent, wherein the mail data may be an online mail or an offline mail, and the processor 101 may be used for calling a computer-readable program stored in the memory 105 and executing the following operations of the embodiments of the error correction method.

The invention also provides a computer readable storage medium.

In this embodiment, the computer readable storage medium stores a computer readable program, and the method implemented when the computer readable program is executed by one or more processors may refer to each embodiment of the error correction method based on the language model of the present invention, so that redundant description is omitted.

According to the method and the device provided by the embodiment of the invention, the word frequency of each word in the corpus is calculated, whether the word has errors is judged based on the word frequency, and the judgment principle is that whether the wrong word exists in the corpus to be corrected based on the characteristic according to the characteristic that the correct expression is far larger than the incorrect expression, so that the word is recalled and corrected, the detection difficulty is reduced, and the detection accuracy and efficiency are improved.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims

1. An error correction method based on a language model is characterized by comprising the following steps:

acquiring text data to be corrected, and performing word segmentation processing on the text data according to a segmentation model in the language model to obtain a first word sequence and a second word sequence;

calculating the probability of each keyword in the first word sequence and the second word sequence being segmented by the language model, and performing secondary word segmentation processing on the text data according to the probability to obtain a third word sequence;

calculating the word frequency of each keyword in the third word sequence, wherein the word frequency is obtained by querying a preset dictionary statistic through the keywords;

judging whether the word frequency reaches a preset word frequency threshold value;

if not, converting the corresponding first keyword into pinyin characteristics through the language model;

and recalling candidate words corresponding to the first keyword by using a preset recall model according to the pinyin characteristics, and selecting correct words from the candidate words to correct the first keyword.

2. The method of claim 1, wherein the obtaining the first word sequence and the second word sequence by performing word segmentation on the text data according to a segmentation model in the language model comprises:

matching the field with the most continuous characters in the text data with a preset word segmentation table based on the left-to-right direction according to a forward maximum matching algorithm, and segmenting a keyword if the field exists in the word segmentation table in a matching manner until the text data is segmented completely to obtain the first word sequence;

and matching the field with the most continuous characters in the text data with a preset word segmentation table based on a direction from right to left according to a reverse maximum matching algorithm, and segmenting a keyword if the field exists in the word segmentation table, until the text data is segmented completely, so as to obtain the second word sequence.

3. The language model-based error correction method according to claim 2, wherein the calculating the probability of each keyword in the first word sequence and the second word sequence occurring during word segmentation in the language model, and performing secondary word segmentation processing on the text data according to the probability to obtain a third word sequence comprises:

4. The language model-based error correction method according to any one of claims 1-3, wherein the predetermined dictionary contains at least a keyword and a homophonic keyword of the keyword, and the calculating the word frequency of each keyword in the third word sequence comprises:

5. The language model-based error correction method according to claim 4, wherein the converting the corresponding first keyword into pinyin characteristics through the language model comprises:

performing pronunciation training on the first keyword according to an acoustic model in the language model to obtain a pronunciation syllable of the first keyword;

and converting the pronunciation syllables into corresponding pinyin characters according to a coding model in the language model, sequencing the pinyin characters according to the pronunciation sequence, and outputting corresponding pinyin characteristics.

6. The language model-based error correction method according to claim 5, wherein the recalling the candidate words corresponding to the first keyword by using a preset recall model according to the pinyin features, and selecting a correct word from the candidate words to correct the error of the first keyword comprises:

according to the pinyin features, keywords with the same pinyin features are inquired from a homophone dictionary to form a homophone recall set;

inquiring a corresponding inverted index according to each single character in the first keyword;

according to the inverted index, corresponding keywords are inquired from a preset dictionary to form an inverted recall set;

and calculating the intersection of the homophonic recall set and the inverted recall set to obtain an error correction keyword of the first keyword, and replacing the first keyword.

7. The language model-based error correction method of claim 6, wherein after the computing the intersection of the homophonic recall set and the inverted recall set to obtain the error correction keyword of the first keyword, before replacing the first keyword according to the error correction keyword, further comprising:

8. A language model-based error correction apparatus, comprising:

the first segmentation module is used for acquiring text data to be corrected and performing word segmentation processing on the text data according to a segmentation model in the language model to obtain a first word sequence and a second word sequence;

the second segmentation module is used for calculating the probability that each keyword in the first word sequence and the second word sequence is segmented by the language model, and performing secondary word segmentation processing on the text data according to the probability to obtain a third word sequence;

the word frequency counting module is used for calculating the word frequency of each keyword in the third word sequence, wherein the word frequency is obtained by the keyword inquiry of preset dictionary statistics;

the judging module is used for judging whether the word frequency reaches a preset word frequency threshold value;

the conversion module is used for converting the corresponding first keyword into pinyin characteristics through the language model when the word frequency does not reach a preset word frequency threshold;

and the error correction module is used for recalling the candidate words corresponding to the first keyword by using a preset recall model according to the pinyin characteristics and selecting correct words from the candidate words to correct the first keyword.

9. A language model-based error correction apparatus, characterized in that the language model-based error correction apparatus comprises: a memory, a processor, and a computer readable program stored on the memory and executable on the processor, the computer readable program when executed by the processor implementing the language model based error correction method of any one of claims 1-7.

10. A computer readable storage medium having computer readable instructions stored thereon, wherein the computer readable program, when executed by one or more processors, implements the language model-based error correction method of any one of claims 1-7.