CN109977415A

CN109977415A - A kind of text error correction method and device

Info

Publication number: CN109977415A
Application number: CN201910261329.6A
Authority: CN
Inventors: 黄腾玉
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-04-02
Filing date: 2019-04-02
Publication date: 2019-07-05

Abstract

The embodiment of the invention provides a kind of text error correction method and devices.In this method, it will be divided into multiple segments to error correction long text, determine at least one corresponding candidate error correction segment of each segment；Based on identified each candidate error correction segment, determine at least one corresponding candidate long text of error correction long text；For at least one candidate long text and to each long text in error correction long text, it successively predicts in the long text, at least one candidate characters of each character bit, and the candidate probability of at least one candidate characters of each character bit, and the candidate probability obtained based on prediction, calculate the assessment score of the long text；Assessment score based at least one candidate long text and to the assessment score of error correction long text, error correction result of the determination to error correction long text.Text error correction method provided in an embodiment of the present invention, can be improved the accuracy rate of the error correction result of long text.

Description

A kind of text error correction method and device

Technical field

The present invention relates to text error correcting technique fields, more particularly to a kind of text error correction method and device.

Background technique

In the existing text error correction method for long text, long text is split as multiple shorter segments, for every A segment carries out independent error correction, obtains the error correction result of each segment, and the error correction result of each segment is combined, obtains The error correction result of long text.In error correction, each segment corresponds to multiple candidate error correction segments, finally chooses one for each segment Preferable error correction segment, as error correction result.

However, inventor has found in the implementation of the present invention, at least there are the following problems for the prior art:

Existing text error correction method, essence or the error correction for being directed to segment, and each segment after error correction is connected Whether clear and coherent pick up the semanteme after coming, whether context relation is connected, and existing text error correction method is there is no consideration, this is undoubtedly Cause the accuracy rate of the error correction result of long text lower.

Summary of the invention

The embodiment of the present invention is designed to provide a kind of text error correction method and device, to improve the error correction knot of long text The accuracy rate of fruit.Specific technical solution is as follows:

A kind of text error correction method, comprising:

It will be divided into multiple segments to error correction long text, determine at least one corresponding candidate error correction piece of each segment Section；

Based on identified each candidate error correction segment, determine described long at least one corresponding candidate of error correction long text Text；

For at least one described candidate long text and each long text in error correction long text, successively prediction should In long text, the candidate of at least one candidate characters of at least one candidate characters and each character bit of each character bit Probability, and the candidate probability obtained based on prediction, calculate the assessment score of the long text；Wherein, at least the one of each character bit A candidate characters, for the character predicted based on the character on other character bits other than the character bit in the long text of place；

Based on the assessment score and the assessment score to error correction long text of at least one candidate long text, determine The error correction result to error correction long text.

Optionally, described at least one described candidate long text and each long article in error correction long text This, successively predicts in the long text, at least one candidate characters of each character bit and at least one time of each character bit The candidate probability of word selection symbol, comprising:

It is preset at least one described candidate long text and each long text in error correction long text, utilization Prediction model successively predicts in the long text that at least one candidate characters and each character bit of each character bit are at least The candidate probability of one candidate characters；

Wherein, the prediction model, based in each character in sample long sentence and sample long sentence each character it is default general Rate training obtains；The sample long sentence, including default sample long sentence and derivative sample long sentence, wherein the derivative sample long sentence, Obtained sample long sentence, the default sample are replaced for the character on the target character position to the default sample long sentence The probability of character on the probability of character on the target character position of long sentence, with the derivative sample long sentence target character position is not Together.

Optionally, the input layer of the network structure of the prediction model, using bidirectional circulating neural network.

Optionally, the candidate probability obtained based on prediction, calculates the assessment score of the long text, comprising:

For each character bit of the long text, at least one candidate characters for the character bit that prediction is obtained, institute The highest candidate characters of candidate probability are stated, the error correction character as the character bit；

Based on the candidate probability of the error correction character on each character bit, the assessment score of the long text is calculated.

Optionally, the candidate probability based on the error correction character on each character bit calculates the assessment point of the long text Number, comprising:

For each character bit, the candidate probability of the former character on the character bit is determined, and calculate entangling on the character bit The logarithm loss of the candidate probability of the candidate probability of error character and the former character；The original character is to be located in the long text Character on the character bit；

The each logarithm being calculated loss is summed, and by summed result divided by the character digit of the long text, Obtain the assessment score of the long text.

Optionally, the assessment score based at least one candidate long text and the commenting to error correction long text Estimate score, determine the error correction result to error correction long text, comprising:

It determines at least one described candidate long text, the highest candidate long text of assessment score；

The assessment score for judging the highest candidate long text of the assessment score, with the assessment to error correction long text point Several difference, if be greater than preset threshold value, if it does, by the highest candidate long text of the assessment score, as described Error correction result to error correction long text.

Optionally, at least one corresponding candidate error correction segment of each segment of the determination, comprising:

For each segment, using preset language model, the corresponding multiple initial error correction segments of the segment are obtained, and The assessment score of each initial error correction segment；Wherein, the segment that the language model is used to input carries out error correction and is error correction Obtained segment calibration assessment score；

At least one initial error correction segment that possessed assessment score is met to the first default screening conditions, as the piece At least one corresponding candidate error correction segment of section.

Optionally, described based on identified each candidate error correction segment, it determines described corresponding extremely to error correction long text A few candidate long text, comprising:

Based on identified each candidate error correction segment, generate described to the corresponding multiple initial long articles of error correction long text This；

For each initial long text, according in the initial long text, the assessment score of each candidate's error correction segment is determined The assessment score of the initial long text；

The multiple initial long texts that will there is assessment score to meet the second default screening conditions, as described long to error correction At least one corresponding candidate long text of text.

Optionally, described to be directed to each initial long text, according in the initial long text, each candidate's error correction segment is commented Estimate score, determine the assessment score of the initial long text, comprising:

By the product of the assessment score of candidate's error correction segment each in the initial long text, as commenting for the initial long text Estimate score, or

By the average value of the assessment score of each candidate error correction segment of the initial long text, as the initial long text Assess score.

A kind of text error correction device, comprising:

First determining module determines that each segment is corresponding for that will be divided into multiple segments to error correction long text At least one candidate error correction segment；

Second determining module is determined described corresponding to error correction long text based on identified each candidate error correction segment At least one candidate long text；

Prediction module, for at least one described candidate long text and each long article in error correction long text This, successively predicts in the long text, at least one candidate characters of each character bit and at least one time of each character bit The candidate probability of word selection symbol；Wherein, at least one candidate characters of each character bit, for based on the character bit in the long text of place The character that the character on other character bits in addition is predicted；

Computing module, for at least one described candidate long text and each long article in error correction long text This calculates the assessment score of the long text based on the candidate probability that prediction obtains；

Third determining module, for assessment score based at least one candidate long text and described to error correction long article This assessment score determines the error correction result to error correction long text.

Optionally, the prediction module, is specifically used for:

Optionally, the computing module, including error correction submodule and computational submodule；

The error correction submodule, for being directed to each character bit of the long text, the character bit that prediction is obtained is extremely In few candidate characters, the highest candidate characters of candidate probability, the error correction character as the character bit；

The computational submodule calculates commenting for the long text for the probability based on the error correction character on each character bit Estimate score.

Optionally, the computational submodule, is specifically used for:

For each character bit, the candidate probability of the former character on the character bit is determined, and calculate entangling on the character bit The logarithm loss of the probability of the probability of error character and the former character；The original character is to be located at the character bit in the long text On character；

Optionally, the third determining module, including the first determining submodule and judging submodule；

Described first determines submodule, for determining at least one described candidate long text, assesses the highest time of score Select long text；

The judging submodule, it is and described for judging the assessment score of the highest candidate long text of assessment score The difference of assessment score to error correction long text, if be greater than preset threshold value, if it does, the assessment score is highest Candidate long text, as the error correction result to error correction long text.

Optionally, first determining module, including segmentation submodule, model application submodule and the first screening submodule Block；

The segmentation submodule, for multiple segments will to be divided into error correction long text；

The model application submodule, using preset language model, it is corresponding to obtain the segment for being directed to each segment Multiple initial error correction segments and each initial error correction segment assessment score；Wherein, the language model will be for that will input The segment segment calibration assessment score that carries out error correction and obtain for error correction；

The first screening submodule, for possessed assessment score to be met at least the one of the first default screening conditions A initial error correction segment, as at least one corresponding candidate error correction segment of the segment.

Optionally, second determining module, including generate submodule, the second determining submodule and the second screening submodule Block；

The generation submodule, for generating described to error correction long text based on identified each candidate error correction segment Corresponding multiple initial long texts；

Described second determines submodule, for being directed to each initial long text, according in the initial long text, each candidate The assessment score of error correction segment determines the assessment score of the initial long text；

The second screening submodule, for will have assessment score to meet the multiple initial of the second default screening conditions Long text, as described at least one corresponding candidate long text of error correction long text.

Optionally, it described second determines submodule, is specifically used for:

For each initial long text, by the product of the assessment score of candidate's error correction segment each in the initial long text, As the assessment score of the initial long text, alternatively, by the assessment score of each candidate error correction segment of the initial long text Average value, the assessment score as the initial long text.

A kind of electronic equipment, including processor, communication interface, memory and communication bus, wherein processor, communication connect Mouthful, memory completes mutual communication by communication bus；

Memory, for storing computer program；

Processor when for executing the program stored on memory, realizes above-mentioned any text error correction side Method.

At the another aspect that the present invention is implemented, a kind of computer readable storage medium is additionally provided, it is described computer-readable Instruction is stored in storage medium, when run on a computer, so that computer executes any of the above-described text and entangles Wrong method.

At the another aspect that the present invention is implemented, the embodiment of the invention also provides a kind of, and the computer program comprising instruction is produced Product, when run on a computer, so that computer executes any of the above-described text error correction method.

In text error correction method provided in an embodiment of the present invention, to determine at least one candidate long article to error correction long text This；For at least one candidate long text and to each long text in error correction long text, successively predict in the long text, it is each The candidate probability of at least one candidate characters of at least one candidate characters and each character bit of character bit, and based on pre- The candidate probability measured calculates the assessment score of the long text；Assessment score based on candidate long text and to error correction long article This assessment score determines the error correction result to error correction long text.Due to each candidate characters in long text to be assessed, It is obtained based on the Character prediction on other character bits other than corresponding character bit, therefore the assessment score for the long text being calculated, Consider the linking degree of the semantic clear and coherent degree and context of long text.Therefore, text provided in an embodiment of the present invention Error correction method is higher for the accuracy rate of the error correction result of long text.Certainly, it implements any of the products of the present invention or method must not It is certain to need while reaching all the above advantage.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described.

Fig. 1 is a kind of flow chart of text error correction method provided in an embodiment of the present invention；

Fig. 2 (a) is the candidate that at least one candidate characters of first character position in long text are predicted using prediction model The schematic diagram of probability；

Fig. 2 (b) is the candidate that at least one candidate characters of second character bit in long text are predicted using prediction model The schematic diagram of probability；

Fig. 2 (c) is the candidate that at least one candidate characters of third character bit in long text are predicted using prediction model The schematic diagram of probability；

Fig. 2 (d) is the candidate that at least one candidate characters of the 4th character bit in long text are predicted using prediction model The schematic diagram of probability；

Fig. 3 is a kind of structural schematic diagram of text error correction device provided in an embodiment of the present invention；

Fig. 4 is the structural schematic diagram of a kind of electronic equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention is described.

In order to improve long text error correction result accuracy rate, the embodiment of the invention provides a kind of text error correction method and Device.

It should be noted that a kind of executing subject of text error correction method provided in an embodiment of the present invention, can be one kind Text error correction device, the device can be applied in the electronic equipment with text input function.In a particular application, the electronics Equipment can be mobile phone, computer, digital broadcast terminal, messaging devices, tablet device, Medical Devices, body-building equipment, or Personal digital assistant etc..

Firstly, a kind of text error correction method provided in an embodiment of the present invention is described in detail.As shown in Figure 1, this hair A kind of text error correction method that bright embodiment provides, may comprise steps of:

S101: will be divided into multiple segments to error correction long text, determine at least one corresponding candidate of each segment Error correction segment.

Wherein, the specific implementations of multiple segments will be divided into error correction long text there are a variety of, illustratively, one In kind implementation, the segmentation that treat error correction long text can be realized using language model.Wherein, language model may include N- Gram language model, also referred to as N metagrammar model.N-gram language model realizes that the basic thought of text segmentation is: by text The sliding window that size is N is carried out by number of characters to operate, and forms the segment that multiple character lengths are N.In another implementation In, the segmentation for treating error correction long text can be realized according to the punctuation mark in error correction long text.It should be noted that appointing The method what can will be divided into multiple segments to error correction long text, can be applied to text error correction side provided in an embodiment of the present invention In method.

In addition, it is more to determine that the specific implementation of at least one corresponding candidate error correction segment of each segment exists Kind.Illustratively, in one implementation, at least one corresponding candidate error correction segment of each segment is determined, it can be with Include:

Here, preset language model may include N-gram language model.It is available using N-gram language model The probability of each initial error correction segment corresponding to each segment, the probability represent initial error correction segment as corresponding segment Correct error correction segment probability, it is to be understood that the sum of the probability of each initial error correction segment corresponding to each segment It is 1.It, can be by the probability of the initial error correction segment obtained using N-gram language model, as initial in the embodiment of the present invention The assessment score of error correction segment.In addition, N value is bigger in N-gram language model, obtained initial error correction segment is more. In practical applications, can rule of thumb or actual demand presets suitable N value, for example, the value of N can be 2,3 or 4 Deng.

It is understood that screening to initial error correction segment using the first default screening conditions, reduction can be played The effect of the quantity of candidate error correction segment, to screen out unnecessary initial error correction segment for subsequent step.Wherein, using first Default screening conditions may include: that assessment score is greater than preset the to the implementation that initial error correction segment is screened The initial error correction segment of one threshold value is as candidate error correction segment；Alternatively, to each initial error correction segment corresponding to each segment, Descending sort is carried out according to assessment score, and from the resulting sequence that sorts, X initial error correction segment is entangled as candidate before choosing Wrong segment.Wherein, in the implementation for determining candidate error correction segment according to first threshold, first threshold can be greater than for one 0 and the numerical value less than 1.For example, initial error correction segment of the score greater than 0.5 can will be assessed as candidate error correction segment；It is arranging Sequence determines in the implementation of candidate error correction segment that X can be more than or equal to 2 positive integer for one, for example, can be from sequence institute In the sequence obtained, preceding 3 initial error correction segments are chosen as candidate error correction segment.

Illustratively, in another implementation, at least one corresponding candidate error correction piece of each segment is determined Section, may include:

For each segment, using preset language model, the corresponding multiple initial error correction segments of the segment are obtained；

Using obtained multiple initial error correction segments as the corresponding candidate error correction segment of the segment, alternatively, from acquired Multiple initial error correction segments in randomly select the segment of predetermined quantity as the corresponding candidate error correction segment of the segment.

It is emphasized that the tool of at least one corresponding candidate error correction segment of the above-mentioned each segment of the determination Body implementation, it is merely exemplary, it should not constitute the restriction to the embodiment of the present invention.

S102: based on identified each candidate error correction segment, determine it is described to error correction long text it is corresponding at least one Candidate long text.

Wherein, it is entangled to the corresponding candidate long text of error correction long text by least two candidates in each candidate error correction segment Wrong segment is constituted.Specifically, for each candidate long text, the number of each candidate error correction segment included by candidate's long text Amount is equal with the quantity of the segment to error correction long text, also, candidate error correction segment exists each of included by candidate's long text Position in candidate's long text, segment corresponding with candidate's error correction segment, described identical to the position in corrected text.

Wherein, based on identified each candidate error correction segment, determine it is described to error correction long text it is corresponding at least one There may be a variety of for the specific implementation of candidate long text.Illustratively, in one implementation, based on identified each A candidate's error correction segment, it is determining described at least one corresponding candidate long text of error correction long text, may include:

The multiple initial long texts that will there is assessment score to meet the second default screening conditions, as described long to error correction At least one corresponding candidate long text of text.It is understood that initial long text using the second default screening conditions into Row screening can play the role of the quantity for reducing candidate long text, to screen unnecessary initial length for subsequent step Text.

Wherein, it is entangled to the corresponding initial long text of error correction long text by least two candidates in each candidate error correction segment Wrong segment is constituted.Specifically, being directed to each initial long text: the number of each candidate's error correction segment included by the initial long text It measures equal with the quantity of the segment to error correction long text；Candidate error correction segment is initial at this each of included by the initial long text Position in long text, segment corresponding with candidate's error correction segment, described identical to the position in corrected text.Actually answer It, can be as much as possible to generate initial long text under the premise of the two conditions, in conjunction with the mode of permutation and combination in.

In addition, according in initial long text, the assessment score of each candidate's error correction segment determines the assessment of initial long text There may also be a variety of for the specific implementation of score.It illustratively, in one implementation, can will be in initial long text The product of the assessment score of each candidate's error correction segment, the assessment score as initial long text.In another implementation, The average value of the assessment score of each candidate error correction segment of initial long text, the assessment as initial long text can be divided Number.

In addition, being screened using the second default screening conditions to initial long text, it can play and reduce initial long text Quantity effect, to screen out unnecessary initial long text for subsequent step.Wherein, using the second default screening item Part may include: that possessed assessment score is greater than preset second threshold to the implementation that initial long text is screened The initial long text of value, as candidate long text；Alternatively, descending sort is carried out according to assessment score to each initial long text, And from the resulting sequence that sorts, Y initial error correction segment is as candidate error correction segment before choosing.It is understood that here Second threshold, can be identical as the first threshold that uses when determining candidate's error correction segment in S101, can also be with first Threshold value is different.In addition, the value of Y here, it can be identical as the value of X used when determining candidate's error correction segment in S101, It can also be different from the value of X.

S103: at least one described candidate long text and each long text in error correction long text, successively It predicts in the long text, at least one candidate characters of each character bit and at least one candidate characters of each character bit Candidate probability calculate the assessment score of the long text and based on the obtained candidate probability of prediction；Wherein, each character bit At least one candidate characters, for the word predicted based on the character on other character bits other than the character bit in the long text of place Symbol.

Here, it successively predicts in long text, at least one candidate characters and each character bit of each character are at least There are a variety of for the specific implementation of the candidate probability of one candidate characters.Illustratively, in one implementation, Ke Yili With preset prediction model, successively to predict in long text, at least one candidate characters and each character of each character bit The candidate probability of at least one candidate characters of position.

In the prior art, it is based on the language mould of RNN (Recurrent Neural Networks, Recognition with Recurrent Neural Network) Type can predict subsequent character according to character currently entered.According to this feature, the language mould based on RNN can be used Type realizes the error correction to shorter text.For example, when the short text of error correction is " I very number ", it can be according to " I of front " number " word is corrected as " good " word by very " two words.However, in the case where carrying out error correction to long text, in addition to will be to long text Each character carries out error correction, it is also contemplated that the correctness of entire long text, and in long text, each character not only with the character The character of front is related, also related with the subsequent character of the character, therefore, if only come according to the character before each character bit It predicts the character on the character bit, error correction is carried out to long text, error correction foundation is obviously not enough.For example, to error correction long article When this is " I loves background Tian An-men ", if only according to " I likes " two words, to predict " I likes " subsequent character, candidate word Symbol is very more, in this way, just cannot achieve the error correction to " background " the two words.And if combine " Tian An-men " these three Word, the character for being just easy to predict between " I likes " and " Tian An-men " should be " Beijing ", rather than " background ", so as to By " I loves background Tian An-men ", effectively error correction is " I loves Beijing Tian An-men ".

In the embodiment of the present invention, in order to realize the error correction to long text, two-way RNN (Bi-directional can be based on Recurrent Neural Networks, bidirectional circulating neural network) build the prediction model, two-way RNN is by two RNN is docking together composition, can be in the character on any character position for predicting long text, known to before the character Character and the subsequent known character of the character, predict the character on the character bit accordingly, to realize to the effective of long text Error correction.

In practical applications, the network structure of the prediction model can be divided into input layer, middle layer and output layer. Wherein, input layer uses two-way RNN, and the output of input layer is codetermined by two RNN docked；The output of input layer connects centre The input of layer；The network structure of middle layer can specifically determine according to application scenarios, such as can be full articulamentum, the present invention couple The network structure of middle layer is without limitation；The output of middle layer connects the input of output layer；Output layer can use Softmax layers, Using softmax layers, the probability for the candidate characters that each character for long text can be made to predict, and it is equal to 1.

It is understood that above-mentioned input layer, middle layer and output layer, are functional level, do not represent pre- The level quantity for surveying the network structure of model is only three layers.Wherein, input layer, middle layer or output layer, network structure all may be used To be multilayer.For example, when output layer is Softmax layers, if the vector dimension of the output of scheduled middle layer, with When the vector dimension of Softmax layers of input mismatches, then one can be arranged between scheduled middle layer and Softmax layers Full articulamentum constitutes new middle layer then the full connection and scheduled middle layer are together.

In addition, the default sample long sentence of training prediction model can be using corresponding vertical in the training stage of prediction model The natural discourse of field or general field.For example, default sample long sentence are as follows: " I loves Beijing Tian An-men " then presets sample long sentence Corresponding derivative sample long sentence may include: " I suffers Beijing Tian An-men ", " I likes Beijing Tian Ans ", " I loves background Tian An-men " And " nest loves Beijing Tian An-men " etc..Wherein, presetting each character in sample long sentence has preset probability value, derivative sample In this long sentence, probability with default sample long sentence identical characters can be equal with the probability for corresponding to character in default sample long sentence； And derive in sample long sentence, the probability of the character different from default sample long sentence, then it can be corresponding with default sample long sentence The probability of character is different.For example, the probability of each character in default sample long sentence all can be 1, and derive in sample long sentence, Character identical with default sample long sentence, probability may be 1, and the character different from default sample long sentence: " suffering ", " ", " back ", " scape " and " nest " these words probability all can be 0.Correspondingly, in the service stage of prediction model, institute Predict the obtained candidate probability of candidate characters, it all can be value between 0-1.

Fig. 2 (a) is the candidate that at least one candidate characters of first character position in long text are predicted using prediction model The schematic diagram of probability；Fig. 2 (b) is that at least one candidate characters of second character bit in long text are predicted using prediction model The schematic diagram of candidate probability；Fig. 2 (c) is at least one candidate word that third character bit in long text is predicted using prediction model The schematic diagram of the candidate probability of symbol；Fig. 2 (d) is at least one time that the 4th character bit in long text is predicted using prediction model The schematic diagram of the candidate probability of word selection symbol.It can be seen that the prediction model shelters from long text similar to a word filling model every time A character bit on character, the word filling on the character bit, until the character that each character bit of long text can be filled in It is predicted out.In addition, it can further be seen that the sum of the candidate probability of the corresponding candidate characters of each character bit is 1 from Fig. 2.

In addition, being based at least one described candidate long text and each long text in error correction long text Predict obtained candidate probability, calculating the specific implementation of the assessment score of the long text, there are a variety of.Illustratively, one In kind implementation, based on the candidate probability that prediction obtains, the assessment score of the long text is calculated, may include:

It is understood that when predicting the candidate characters on a certain character bit of long text, each time for predicting Word selection symbol, may include the former character on the character bit.As shown in Fig. 2 (d), when predicted using prediction model " abcd " this When candidate characters in long text on the 4th character bit, the candidate characters predicted include: " c ", " d " and " b ".Its In, " d " is the former character, is the maximum candidate characters of candidate probability, therefore " d " is also since the candidate probability of " d " is 70% The error correction character.

In addition, the probability based on the error correction character on each character bit, calculates the specific reality of the assessment score of the long text There are a variety of for existing mode.Illustratively, in one implementation, the probability based on the error correction character on each character bit, meter The assessment score for calculating the long text may include:

For each character bit, the candidate probability of the former character on the character bit is determined, and calculate entangling on the character bit The logarithm loss of the candidate probability of the candidate probability of error character and former character；The original character is to be located at the word in the long text Accord with the character on position；

Wherein, for each character bit, determine that the specific implementation of the candidate probability of the former character on the character bit is deposited It, illustratively, in one implementation,, can in each candidate characters for predicting for each character bit a variety of To include former character, correspondingly, being assured that the candidate probability of former character.

It should be noted that in the embodiment of the present invention, since candidate long text can satisfy the first screening conditions and second Screening conditions, therefore for the former character in candidate long text, seldom there are the candidate characters predicted does not include original The case where character.However, if the charactor comparison on some character bit is uncommon, may be deposited for error correction long text In such case.For example, being " my Beijing Tian An-men " to error correction long text, predict to second character bit of error correction long text When candidate characters, each candidate characters position predicted includes: " love ", " suffering " and " ", does not include " " this character. In response to this, " " can be determined as a lower value, such as 0 etc. by the candidate probability of this former character.

S104: assessment score and the assessment to error correction long text point based at least one candidate long text Number determines the error correction result to error correction long text.

Here, assessment score and the assessment to error correction long text point based at least one candidate long text Number determines the specific implementation of the error correction result to error correction long text there are a variety of, illustratively, in a kind of realization side In formula, based on the assessment score and the assessment score to error correction long text of at least one candidate long text, institute is determined The error correction result to error correction long text is stated, may include:

Here, preset threshold value can be more than or equal to 0 numerical value for one.

It is understood that when the assessment score that judging result is the highest candidate long text of the assessment score, with institute State to error correction long text assessment score difference, be not more than preset threshold value when, then it represents that error correction long text do not need into Row error correction.

In text error correction method provided in an embodiment of the present invention, to determine at least one candidate long article to error correction long text This；For at least one candidate long text and to each long text in error correction long text, successively predict in the long text, it is each The candidate probability of at least one candidate characters of at least one candidate characters and each character bit of character bit, and based on pre- The candidate probability measured calculates the assessment score of the long text；Assessment score based on candidate long text and to error correction long article This assessment score determines the error correction result to error correction long text.Due to each candidate characters in long text to be assessed, It is obtained based on the Character prediction on other character bits other than corresponding character bit, therefore the assessment score for the long text being calculated, Consider the linking degree of the semantic clear and coherent degree and context of long text.Therefore, text provided in an embodiment of the present invention Error correction method is higher for the accuracy rate of the error correction result of long text.

Corresponding to a kind of above-mentioned text error correction method, the present invention also provides a kind of text error correction devices.Such as Fig. 3 institute Show, the present invention also provides a kind of text error correction device, may include:

First determining module 301 determines that each segment respectively corresponds for that will be divided into multiple segments to error correction long text At least one candidate error correction segment；

Second determining module 302 is determined described corresponding to error correction long text based on identified each candidate error correction segment At least one candidate long text；

Prediction module 303, for at least one described candidate long text and described to each in error correction long text Long text successively predicts in the long text, at least the one of at least one candidate characters of each character bit and each character bit The candidate probability of a candidate characters；Wherein, at least one candidate characters of each character bit, for based on the word in the long text of place The character that the character on other character bits other than symbol position is predicted；

Computing module 304, for at least one described candidate long text and described to each in error correction long text Long text calculates the assessment score of the long text based on the candidate probability that prediction obtains；

Third determining module 305, for assessment score based at least one candidate long text and described to error correction The assessment score of long text determines the error correction result to error correction long text.

Optionally, in one implementation, the prediction module 303, specifically can be used for:

Optionally, in one implementation, the input layer of the network structure of the prediction model can be followed using two-way Ring neural network.

Optionally, in one implementation, the computing module 304 may include error correction submodule and calculating submodule Block；

Wherein, the error correction submodule will predict the obtained character bit for being directed to each character bit of the long text At least one candidate characters in, the highest candidate characters of candidate probability, the error correction character as the character bit；

The computational submodule calculates the length for the probability based on error correction character and former character on each character bit The assessment score of text.

Optionally, in one implementation, the computational submodule can be specifically used for:

For each character bit, the candidate probability of the former character on the character bit is determined, and calculate entangling on the character bit The logarithm loss of the probability of the probability of error character and former character；The original character is in the long text, on the character bit Character；

Optionally, in one implementation, the third determining module 305, may include first determine submodule and Judging submodule；

Optionally, in one implementation, first determining module 301 may include that segmentation submodule, model are answered With submodule and the first screening submodule；

Optionally, in one implementation, second determining module 302 may include generating submodule, second really Stator modules and the second screening submodule；

Optionally, in one implementation, it described second determines submodule, can be specifically used for:

Text error correction device provided in an embodiment of the present invention, to determine at least one candidate long text to error correction long text； For at least one candidate long text and to each long text in error correction long text, successively predict in the long text, each word The candidate probability of at least one candidate characters of position and at least one candidate characters of each character bit is accorded with, and based on prediction Obtained candidate probability calculates the assessment score of the long text；Assessment score based on candidate long text and to error correction long text Assessment score, determine error correction result to error correction long text.It is general due to each candidate characters in long text to be assessed Rate, the Character prediction being based on other character bits other than corresponding character bit obtains, therefore the assessment for the long text being calculated Score, it is contemplated that the linking degree of the semantic clear and coherent degree and context of long text.Therefore, the embodiment of the present invention provides Text error correction device, it is higher for the accuracy rate of the error correction result of long text.

The embodiment of the invention also provides a kind of electronic equipment, as shown in figure 4, include processor 401, communication interface 402, Memory 403 and communication bus 404, wherein processor 401, communication interface 402, memory 403 are complete by communication bus 404 At mutual communication,

Memory 403, for storing computer program；

Processor 401 performs the steps of when for executing the program stored on memory 403

Electronic equipment provided in an embodiment of the present invention, to determine at least one candidate long text to error correction long text；For At least one candidate long text and to each long text in error correction long text, is successively predicted in the long text, each character bit At least one candidate characters and each character bit at least one candidate characters candidate probability, and based on prediction obtain Candidate probability, calculate the assessment score of the long text；Assessment score based on candidate long text and commenting to error correction long text Estimate score, determines the error correction result to error correction long text.Due to each candidate characters in long text to be assessed, it is based on pair The Character prediction on other character bits other than character bit is answered to obtain, therefore the assessment score for the long text being calculated, it is contemplated that The linking degree of the semantic clear and coherent degree and context of long text.Therefore, electronic equipment provided in an embodiment of the present invention, It is higher for the accuracy rate of the error correction result of long text.

The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, abbreviation PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, abbreviation EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc.. Only to be indicated with a thick line in figure, it is not intended that an only bus or a type of bus convenient for indicating.

Communication interface is for the communication between above-mentioned electronic equipment and other equipment.

Memory may include random access memory (Random Access Memory, abbreviation RAM), also may include Nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.Optionally, memory may be used also To be storage device that at least one is located remotely from aforementioned processor.

Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, Abbreviation CPU), network processing unit (Network Processor, abbreviation NP) etc.；It can also be digital signal processor (Digital Signal Processing, abbreviation DSP), specific integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC), field programmable gate array (Field-Programmable Gate Array, Abbreviation FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components.

In another embodiment provided by the invention, a kind of computer readable storage medium is additionally provided, which can It reads to be stored with instruction in storage medium, when run on a computer, so that computer executes any institute in above-described embodiment The text error correction method stated.

In another embodiment provided by the invention, a kind of computer program product comprising instruction is additionally provided, when it When running on computers, so that computer executes any text error correction method in above-described embodiment.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..

It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.

Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device and For electronic equipment embodiment, since it is substantially similar to the method embodiment, so be described relatively simple, related place referring to The part of embodiment of the method illustrates.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims

1. a kind of text error correction method characterized by comprising

It will be divided into multiple segments to error correction long text, determine at least one corresponding candidate error correction segment of each segment；

Based on identified each candidate error correction segment, determine described at least one corresponding candidate long article of error correction long text This；

For at least one described candidate long text and each long text in error correction long text, the long article is successively predicted In this, the candidate probability of at least one candidate characters of at least one candidate characters and each character bit of each character bit, And the candidate probability obtained based on prediction, calculate the assessment score of the long text；Wherein, at least one candidate of each character bit Character, for the character predicted based on the character on other character bits other than the character bit in the long text of place；

Based on the assessment score and the assessment score to error correction long text of at least one candidate long text, described in determination Error correction result to error correction long text.

2. the method according to claim 1, wherein described at least one described candidate long text and described It to each long text in error correction long text, successively predicts in the long text, at least one candidate characters of each character bit, with And the candidate probability of at least one candidate characters of each character bit, comprising:

For at least one described candidate long text and each long text in error correction long text, preset prediction is utilized Model successively predicts in the long text, at least one candidate characters of each character bit and each character bit at least one The candidate probability of candidate characters；

Wherein, the prediction model, the predetermined probabilities instruction based on each character in each character in sample long sentence and sample long sentence Practice and obtains；The sample long sentence, including default sample long sentence and derivative sample long sentence, wherein the derivative sample long sentence, is right Character on the target character position of the default sample long sentence is replaced obtained sample long sentence, the default sample long sentence Target character position on character probability, it is different from the probability of character on the derivative sample long sentence target character position.

3. according to the method described in claim 2, it is characterized in that, the input layer of the network structure of the prediction model, uses Bidirectional circulating neural network.

4. the method according to claim 1, wherein the candidate probability obtained based on prediction, calculates the length The assessment score of text, comprising:

For each character bit of the long text, at least one candidate characters for the character bit that prediction is obtained, the time The highest candidate characters of probability are selected, the error correction character as the character bit；

5. according to the method described in claim 4, it is characterized in that, the candidate based on the error correction character on each character bit Probability calculates the assessment score of the long text, comprising:

For each character bit, the candidate probability of the former character on the character bit is determined, and calculate the erroneous character correction on the character bit The logarithm loss of the candidate probability of the candidate probability of symbol and the former character；The original character is to be located at the word in the long text Accord with the character on position；

The each logarithm being calculated loss is summed, and summed result is obtained divided by the character digit of the long text The assessment score of the long text.

6. method according to claim 1-5, which is characterized in that described based at least one described candidate long article This assessment score and the assessment score to error correction long text determines the error correction result to error correction long text, comprising:

The assessment score for judging the highest candidate long text of the assessment score, with the assessment score to error correction long text Difference, if be greater than preset threshold value, if it does, by the highest candidate long text of the assessment score, as described wait entangle The error correction result of wrong long text.

7. the method according to claim 1, wherein at least one corresponding time of each segment of the determination Select error correction segment, comprising:

For each segment, using preset language model, the corresponding multiple initial error correction segments of the segment are obtained, and each The assessment score of initial error correction segment；Wherein, the language model is used to the segment of input carrying out error correction and obtain for error correction Segment calibration assessment score；

At least one initial error correction segment that possessed assessment score is met to the first default screening conditions, as the segment pair The candidate error correction segment of at least one answered.

8. the method according to the description of claim 7 is characterized in that described based on identified each candidate error correction segment, really It is fixed described at least one corresponding candidate long text of error correction long text, comprising:

Based on identified each candidate error correction segment, generate described to the corresponding multiple initial long texts of error correction long text；

For each initial long text, according in the initial long text, the assessment score of each candidate's error correction segment determines that this is first The assessment score of beginning long text；

The multiple initial long texts that will there is assessment score to meet the second default screening conditions, as described to error correction long text At least one corresponding candidate long text.

9. according to the method described in claim 8, it is characterized in that, described be directed to each initial long text, according to the initial length In text, the assessment score of each candidate's error correction segment determines the assessment score of the initial long text, comprising:

Assessment point by the product of the assessment score of candidate's error correction segment each in the initial long text, as the initial long text Number, or

Assessment by the average value of the assessment score of each candidate error correction segment of the initial long text, as the initial long text Score.

10. a kind of text error correction device characterized by comprising

First determining module determines that each segment is corresponding at least for that will be divided into multiple segments to error correction long text One candidate error correction segment；

Second determining module is determined described corresponding at least to error correction long text based on identified each candidate error correction segment One candidate long text；

Prediction module, for being directed at least one described candidate long text and each long text in error correction long text, It successively predicts in the long text, at least one candidate characters of each character bit and at least one candidate of each character bit The candidate probability of character；Wherein, at least one candidate characters of each character bit, for based on the character bit in the long text of place with The character that character on other outer character bits is predicted；

Computing module, for being directed at least one described candidate long text and each long text in error correction long text, Based on the candidate probability that prediction obtains, the assessment score of the long text is calculated；

Third determining module, for assessment score based at least one candidate long text and described to error correction long text Score is assessed, determines the error correction result to error correction long text.

11. device according to claim 10, which is characterized in that the prediction module is specifically used for:

12. device according to claim 11, which is characterized in that the input layer of the network structure of the prediction model is adopted With bidirectional circulating neural network.

13. device according to claim 10, which is characterized in that the computing module, including error correction submodule and calculating Submodule；

The error correction submodule, for being directed to each character bit of the long text, at least the one of the character bit that prediction is obtained In a candidate characters, the highest candidate characters of candidate probability, the error correction character as the character bit；

The computational submodule calculates the assessment point of the long text for the probability based on the error correction character on each character bit Number.

14. device according to claim 13, which is characterized in that the computational submodule is specifically used for:

For each character bit, the candidate probability of the former character on the character bit is determined, and calculate the erroneous character correction on the character bit The logarithm loss of the probability of the probability of symbol and the former character；The original character is in the long text, on the character bit Character；

15. the described in any item devices of 0-14 according to claim 1, which is characterized in that the third determining module, including first Determine submodule and judging submodule；

Described first determines submodule, and for determining at least one described candidate long text, assessment score is highest candidate long Text；

The judging submodule, for judging the assessment score of the highest candidate long text of the assessment score, with described wait entangle The difference of the assessment score of wrong long text, if be greater than preset threshold value, if it does, by the highest candidate of the assessment score Long text, as the error correction result to error correction long text.

16. device according to claim 10, which is characterized in that first determining module, including segmentation submodule, mould Type application submodule and the first screening submodule；

The model application submodule, for it is corresponding more to obtain the segment using preset language model for each segment The assessment score of a initial error correction segment and each initial error correction segment；Wherein, the language model is used for the piece that will be inputted Duan Jinhang error correction and the segment calibration assessment score obtained for error correction；

The first screening submodule, at the beginning of possessed assessment score is met at least one of the first default screening conditions Beginning error correction segment, as at least one corresponding candidate error correction segment of the segment.

17. device according to claim 16, which is characterized in that second determining module, including generate submodule, the Two determine submodule and the second screening submodule；

The generation submodule, for generating described corresponding to error correction long text based on identified each candidate error correction segment Multiple initial long texts；

Described second determines submodule, for being directed to each initial long text, according in the initial long text, and each candidate's error correction The assessment score of segment determines the assessment score of the initial long text；

The second screening submodule, multiple initial long articles for will there is assessment score to meet the second default screening conditions This, as described at least one corresponding candidate long text of error correction long text.

18. device according to claim 17, which is characterized in that described second determines submodule, is specifically used for:

For each initial long text, by the product of the assessment score of candidate's error correction segment each in the initial long text, as The assessment score of the initial long text, alternatively, being averaged the assessment score of each candidate error correction segment of the initial long text Value, the assessment score as the initial long text.

19. a kind of electronic equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein processing Device, communication interface, memory complete mutual communication by communication bus；

Memory, for storing computer program；

Processor when for executing the program stored on memory, realizes any method and step of claim 1-9.