CN111222331B

CN111222331B - Auxiliary decoding method and device, electronic equipment and readable storage medium

Info

Publication number: CN111222331B
Application number: CN201911418820.1A
Authority: CN
Inventors: 吴帅; 李健; 武卫东
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-03-26
Anticipated expiration: 2039-12-31
Also published as: CN111222331A

Abstract

The invention provides an auxiliary decoding method, an auxiliary decoding device, electronic equipment and a readable storage medium. Obtaining a corpus to be decoded, decoding the corpus to be decoded through a decoder to obtain a plurality of candidate texts and an initial score of each candidate text, respectively inputting the plurality of candidate texts into an original language model to obtain an original score of each candidate text, performing reverse order processing on the plurality of candidate texts to obtain a plurality of reverse order texts, respectively inputting the plurality of reverse order texts into a reverse NGram model to obtain a reverse score of each candidate text, obtaining an update score of each candidate text according to the original score and the reverse score of each candidate text, and determining the candidate text with the highest update score as the decoded text of the corpus to be decoded. By using the reverse NGram model and matching with the original language model, the initial scores of the decoded candidate texts are updated, the decoded texts can be obtained relatively quickly, and the accuracy of the decoded texts can be improved.

Description

Auxiliary decoding method and device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of data decoding technologies, and in particular, to an auxiliary decoding method and apparatus, an electronic device, and a readable storage medium.

Background

At present, with the rapid development of the information industry, a large amount of data is generated every moment, and a considerable part of the data needs to be converted into text for the convenience of human use or storage. Therefore, relevant decoding (data to text) techniques are in the process of production in a wide market. These techniques include, but are not limited to, machine translation, speech recognition, optical character recognition, input methods, automated question and answer, and the like. In these techniques, the accuracy of the converted natural language after decoding is an important performance indicator for market application. In addition, there are also stringent requirements on speed, usually due to real-time performance. In the prior art, the data decoding is less accurate and slower.

Disclosure of Invention

The embodiment of the invention provides an auxiliary decoding method based on reverse NGram, aiming at improving the precision and the decoding speed of data decoding.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides an auxiliary decoding method based on an inverse NGram, where the method includes:

obtaining a corpus to be decoded, and decoding the corpus to be decoded through a decoder to obtain a plurality of candidate texts and an initial score of each candidate text;

respectively inputting the candidate texts into an original language model to obtain an original score of each candidate text;

respectively carrying out reverse order processing on the candidate texts to obtain a plurality of reverse order texts corresponding to the candidate texts;

respectively inputting the multiple reverse-order texts into a reverse NGram model to obtain a reverse score of each candidate text;

updating the initial score of each text according to the original score and the reverse score of each candidate text to obtain an updated score of each candidate text;

and determining the candidate text with the highest update score as the decoded text of the corpus to be decoded according to the update score of each candidate text.

Optionally, the method further comprises:

acquiring a plurality of reverse-order text samples, performing multi-round training on a reverse NGram model to be trained until a reverse score output by the reverse NGram model to be trained meets a preset condition, and finishing the training to obtain the reverse NGram model.

Optionally, obtaining a plurality of reverse order text samples comprises:

obtaining a plurality of corpus samples, cleaning the corpus samples, and segmenting the cleaned corpus samples according to a word list in a decoder to obtain a plurality of segmented text samples;

carrying out reverse order arrangement on each text sample after word segmentation by taking a word as a unit to obtain a plurality of text samples after reverse order;

and adding sentence beginning symbols and sentence ending symbols at two ends of each text sample after the reverse order to obtain a plurality of reverse order text samples.

Optionally, the candidate texts are segmented texts, each candidate text includes a beginning sentence symbol and an end sentence symbol, and the multiple candidate texts are respectively input into the original language model to obtain an original score of each candidate text, including:

respectively inputting the candidate texts into an original language model, sequentially calculating the probability of each word and the sentence end symbol in each candidate text, and calculating the original score of each candidate text according to the probability of each word and the sentence end symbol in each candidate text;

respectively inputting the multiple reverse-order texts into a reverse NGram model to obtain a reverse score of each candidate text, wherein the reverse score comprises the following steps:

and respectively inputting the candidate texts into a reverse NGram model, sequentially calculating the probability of each word and the initial sentence symbol in each reverse-order text, and calculating the reverse score of each candidate text according to the probability of each word and the initial sentence symbol in each candidate text.

Optionally, respectively performing reverse order processing on the multiple candidate texts to obtain multiple reverse order texts corresponding to the multiple candidate texts, including:

and carrying out reverse order arrangement on each candidate text after word segmentation by taking a word or a sentence initial symbol or a sentence final symbol as a unit to obtain a plurality of candidate texts after reverse order.

In a second aspect, an embodiment of the present invention provides an auxiliary decoding apparatus based on an inverse NGram, where the apparatus includes:

the decoding module is used for acquiring the corpus to be decoded, and decoding the corpus to be decoded through a decoder to obtain a plurality of candidate texts and an initial score of each candidate text;

the original scoring module is used for respectively inputting the candidate texts into an original language model to obtain an original score of each candidate text;

the reverse order module is used for respectively performing reverse order processing on the candidate texts to obtain a plurality of reverse order texts corresponding to the candidate texts;

the reverse scoring module is used for respectively inputting the multiple reverse texts into a reverse NGram model to obtain a reverse score of each candidate text;

the score updating module is used for updating the initial score of each text according to the original score and the reverse score of each candidate text to obtain an updated score of each candidate text;

and the determining module is used for determining the candidate text with the highest update score as the decoding text of the corpus to be decoded according to the update score of each candidate text.

Optionally, the apparatus further comprises:

and the training module is used for acquiring a plurality of reverse-order text samples, performing multi-round training on the reverse NGram model to be trained until the reverse score output by the reverse NGram model to be trained meets a preset condition, and finishing the training to obtain the reverse NGram model.

Optionally, the training module comprises:

the word segmentation sub-module is used for acquiring a plurality of corpus samples, cleaning the corpus samples, and segmenting a plurality of cleaned corpus samples according to a word list in a decoder to obtain a plurality of segmented text samples;

the first reverse order sub-module is used for performing reverse order arrangement on each word-segmented text sample by taking a word as a unit to obtain a plurality of reverse order text samples;

and the adding submodule is used for adding sentence beginning symbols and sentence ending symbols at two ends of each text sample after the reverse order to obtain a plurality of reverse order text samples.

Optionally, the candidate texts are segmented texts, each candidate text includes a beginning sentence symbol and an end sentence symbol, and the original scoring module includes:

the original scoring submodule is used for respectively inputting the candidate texts into an original language model, sequentially calculating the probability of each word and the sentence end symbol in each candidate text, and calculating the original score of each candidate text according to the probability of each word and the sentence end symbol in each candidate text;

the reverse scoring module comprises:

and the reverse scoring submodule is used for respectively inputting the candidate texts into a reverse NGram model, sequentially calculating the probability of each word and the initial sentence symbol in each reverse order text, and calculating the reverse score of each candidate text according to the probability of each word and the initial sentence symbol in each candidate text.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the steps of the reverse NGram-based auxiliary decoding method of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the reverse NGram-based auxiliary decoding method according to the first aspect are implemented.

The method comprises the steps of obtaining a corpus to be decoded, decoding the corpus to be decoded through a decoder to obtain a plurality of candidate texts and an initial score of each candidate text, respectively inputting the plurality of candidate texts into an original language model to obtain an original score of each candidate text, performing reverse order processing on the plurality of candidate texts to obtain a plurality of reverse order texts, respectively inputting the plurality of reverse order texts into a reverse NGram model to obtain a reverse score of each candidate text, updating the initial score of each text according to the original score and the reverse score of each candidate text to obtain an updated score of each candidate text, and determining the candidate text with the highest updated score as the decoded text of the corpus to be decoded according to the updated score of each candidate text. By using the reverse NGram model and matching with the original language model, the initial scores of the decoded candidate texts are updated, the decoded texts can be obtained relatively quickly, and the accuracy of the decoded texts can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without inventive labor.

FIG. 1 is a flow chart illustrating the steps of an auxiliary decoding method based on reverse NGram according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of a method for obtaining a reverse-order text sample according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an auxiliary decoding apparatus based on reverse NGram in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating steps of an auxiliary decoding method based on reverse NGram according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step S101: and acquiring a corpus to be decoded, and decoding the corpus to be decoded through a decoder to obtain a plurality of candidate texts and an initial score of each candidate text.

In this embodiment, the corpus to be decoded may be audio stream data or keyboard stream data, and the decoder is capable of decoding the corpus to be decoded into a text-based language.

In a feasible implementation manner, after the corpus to be decoded is decoded by a decoder, a plurality of candidate texts are obtained, each candidate text has a corresponding initial score, each candidate text is a text after word segmentation, each candidate text comprises a sentence head symbol and a sentence tail symbol, wherein the sentence head symbol is < BOS > and the sentence tail symbol is < EOS >.

In this embodiment, after the corpus to be decoded is decoded by the decoder, a plurality of candidate texts are obtained, for example, the corpus to be decoded of a segment of audio stream data, and after decoding by the decoder, the obtained 3 candidate texts and corresponding initial scores are as follows:

(S1) < BOS > dimensional language is identified as < EOS >

(S2) < BOS > fact of recognition of dimensional language < EOS >

(S3) < BOS > the tail plume was identified as < EOS >

Wherein S1, S2, and S3 are initial scores.

Step S102: and respectively inputting the candidate texts into an original language model to obtain an original score of each candidate text.

In a possible implementation manner, the step S102 specifically includes:

and respectively inputting the candidate texts into an original language model, sequentially calculating the probability of each word and the sentence end symbol in each candidate text, and calculating the original score of each candidate text according to the probability of each word and the sentence end symbol in each candidate text.

In this embodiment, the obtained multiple candidate texts are respectively input into the original language model, the original language model can sequentially calculate the probability of each word and sentence end symbol in each candidate text only by using the preceding information, and then calculate the original score of each candidate text according to the probability of each word and sentence end symbol in each candidate text, where the formula for calculating the original score is as follows:

wherein G is_oldFor the original score, ω is the probability converted to weightFunction, where Π is the running multiplication symbol, p (< EOS >) is the probability of sentence end symbol < EOS >, p (w)_i) For the probability of a word in each candidate text, i is the ranking number of each word, and i and n are positive integers, for example:

if the candidate text is:

< BOS > dimensional language identification is < EOS >

< BOS > fact of recognition of dimensional language < EOS >

< BOS > identifying the tail plume is < EOS >

And omega is an ln function, the original score of each candidate text is as follows:

G¹ _oldln [ p (< EOS >) p (identification) p (of) p (yes)]

G² _oldLn [ p (< EOS >) p (dimension) p (of identification) p (of)]

G³ _oldLn [ p (< EOS >) p (tail feather) p (of identification) p (yes)]

Step S103: and respectively carrying out reverse order processing on the candidate texts to obtain a plurality of reverse order texts corresponding to the candidate texts.

In a possible implementation manner, step S103 specifically includes:

For example: the candidate texts before reverse ordering are: < BOS > dimensional language identification is < EOS >

The corresponding reverse text after reverse ordering is: < EOS > is recognition of the wiki < BOS >

Step S104: and respectively inputting the multiple reverse-order texts into a reverse NGram model to obtain the reverse score of each candidate text.

In a possible implementation manner, step S104 may specifically include:

In this embodiment, N in the inverse NGram model is the total number of words or sentence initials or sentence finals used in calculating the probability of a word, wherein the calculation formula of the probability of a word is:

q(w_i)＝q(w_i|w_i+1，w_i+2)

wherein q (w)_i) I is the order number of the word or sentence start symbol.

For example, if the reverse order text is: the < EOS > person china is my < BOS >, and in the reverse NGram model, if the value of N is 3, the probability calculation mode of each word is as follows:

q (man) ═ q (man | < EoS >)

q (Chinese) ═ q (Chinese | < EOS >, human)

q (yes) ═ q (yes, Chinese)

q (I) ═ q (I | China, Yes)

q (< BOS >) - < q (< BOS > | Y, I)

If the reverse order text is: in the < BOS > of the present invention of the < EOS > scheme technology, in the reverse NGram model, if the value of N is 5, the probability calculation mode of each word is as follows:

q (plan) ═ q (plan | < EOS >)

q (technology) ═ q (technology | < EOS >, solution)

q (of) ═ q (of | < EOS >, solution, technology)

q (invention) ═ q (invention | < EOS >, solution, technology, of)

q (this) ═ q (this | scheme, technical, inventive)

q (< BOS >) > q (< BOS > | technique, invention, present invention)

In the present embodiment, the calculation formula of the inverse score is:

wherein G is_newFor reverse scores, ω is a function that converts probability to weight, pi is a running-multiply symbol, p (< BOS >) is the reverse probability of a period start symbol < BOS >, p (w)_i) For the backward probability of a word in each candidate text, i is the ranking number of each word, and i and n are positive integers, for example:

the reverse order text is:

< EOS > is recognition of the wiki < BOS >

< recognition of EOS > affair the dimensional language < BOS >

< EOS > is identifying the tail plume < BOS >

Omega is an ln function, and the corresponding inverse fraction calculation mode is as follows:

G¹ _newq (yes) of ═ ln [ q (< BOS >) q (wiki) q (recognition) q (yes)]

G² _newQ (of) ═ ln [ q (< BOS >) q (dimensional language) q (recognition) q (fact)]

G³ _newQ (yes) of ═ ln [ q (< BOS >) q (tail feather) q (identification) q (yes)]

Wherein, if N is 3 in the reverse NGram model, G is¹ _newIn (1),

q (< BOS >) > q (< BOS > | recognition, dimension language)

q (dimensional language) ═ q (identification of dimensional language |)

q (identification) ═ q (identification | yes) q (of | < EOS >, yes)

q (Yes) ═ q (Yes | < EOS >)

G² _newAnd G³ _newThe inverse probabilities of words and periods in (1) are similar to the above examples and are not illustrated in detail herein.

Step S105: and updating the initial score of each text according to the original score and the reverse score of each candidate text to obtain the updated score of each candidate text.

In this embodiment, the initial score of each text is updated by the original score and the reverse score to obtain the update score of each candidate text, and the update formula is as follows:

S′＝S+λ(G_new-G_old)

wherein S' is an update score, S is an initial score, lambda is a reverse update weight, the value is between 0 and 1, and G_newTo reverse score, G_oldIs the original score.

Such as: take λ as 0.5

S′₁＝S₁+0.5(G¹ _new-G¹ _old)

S′₂＝S₂+0.5(G² _new-G² _old)

S′₃＝S₃+0.5(G³ _new-G³ _old)

Step S106: and determining the candidate text with the highest update score as the decoded text of the corpus to be decoded according to the update score of each candidate text.

In this embodiment, the candidate text with the highest update score is determined as the decoded text of the corpus to be decoded, and output.

In one possible embodiment, the method further comprises:

In the embodiment, multiple rounds of training are performed on the reverse NGram model to be trained by adopting multiple reverse-order text samples, so that the reverse NGram model is obtained, the reverse-order text is conveniently processed, and the corresponding reverse score is obtained.

Referring to fig. 2, fig. 2 is a flowchart illustrating steps of a method for obtaining reverse-order text samples according to an embodiment of the present invention, as shown in fig. 2, in a possible implementation manner, obtaining a plurality of reverse-order text samples may include the following steps:

step S201: obtaining a plurality of corpus samples, cleaning the corpus samples, and segmenting the cleaned corpus samples according to a word list in a decoder to obtain a plurality of segmented text samples.

In the implementation mode, a plurality of corpus samples are obtained, the corpus samples are text types, the corpus samples are cleaned, symbols and the like which cannot be identified by a reverse NGram model to be trained are removed, then the cleaned corpus samples are participled by taking a word list in a decoder as a participle basis, and the participled text samples are obtained.

For example, the corpus sample after washing is that I is Chinese

The text sample after word segmentation is: i am a Chinese

Step S202: and carrying out reverse order arrangement on each text sample after word segmentation by taking a word as a unit to obtain a plurality of text samples after reverse order.

In this embodiment, each segmented text sample is arranged in reverse order by taking a word as a unit to obtain a plurality of text samples in reverse order, for example:

the text sample after word segmentation is: i am a Chinese

The text samples after the reverse order are: chinese is me

Step S203: and adding sentence beginning symbols and sentence ending symbols at two ends of each text sample after the reverse order to obtain a plurality of reverse order text samples.

In this embodiment, a beginning sentence symbol is added to the beginning sentence of each text sample after the reverse order, and an end sentence symbol is added to the end sentence of each text sample after the reverse order, where the beginning sentence and the end sentence are defined in the preceding language order of the reverse order, for example:

the text samples after the reverse order are: chinese is me

The reverse order text samples with sentence beginning and sentence end added are: < EOS > person China is my < BOS >.

Referring to fig. 3, fig. 3 is a schematic diagram of an auxiliary decoding apparatus based on reverse NGram according to an embodiment of the present invention, as shown in fig. 3, the apparatus includes:

the decoding module 301 is configured to obtain a corpus to be decoded, and decode the corpus to be decoded through a decoder to obtain multiple candidate texts and an initial score of each candidate text;

an original scoring module 302, configured to input the multiple candidate texts into an original language model respectively, so as to obtain an original score of each candidate text;

a reverse order module 303, configured to perform reverse order processing on the multiple candidate texts respectively to obtain multiple reverse order texts corresponding to the multiple candidate texts;

the reverse scoring module 304 is configured to input the multiple reverse texts into a reverse NGram model respectively to obtain a reverse score of each candidate text;

a score updating module 305, configured to update the initial score of each text according to the original score and the reverse score of each candidate text, so as to obtain an updated score of each candidate text;

the determining module 306 is configured to determine, according to the update score of each candidate text, the candidate text with the highest update score as the decoded text of the corpus to be decoded.

Optionally, the apparatus further comprises:

Optionally, the training module comprises:

the reverse scoring module comprises:

Based on the same inventive concept, another embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method according to any of the above-mentioned embodiments of the present application.

Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the electronic device implements the steps of the method according to any of the above embodiments of the present application.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method, the apparatus, the electronic device, and the readable storage medium for auxiliary decoding based on reverse NGram provided by the present application are introduced in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understanding the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An auxiliary decoding method based on reverse NGram, the method comprising:

determining the candidate text with the highest update score as the decoding text of the corpus to be decoded according to the update score of each candidate text;

wherein, in the step of updating the initial score of each text according to the initial score and the reverse score of each candidate text to obtain the updated score of each candidate text, an update formula is adopted as follows:

S′＝S+λ(G_new-G_old)

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein obtaining a plurality of reverse order text samples comprises:

4. The method of claim 1, wherein the candidate texts are segmented texts, each candidate text comprises a beginning of a sentence and an end of a sentence, and the inputting of the candidate texts into an original language model to obtain an original score of each candidate text comprises:

5. The method of claim 4, wherein performing reverse order processing on the candidate texts to obtain multiple reverse order texts corresponding to the candidate texts comprises:

6. An apparatus for reverse NGram-based auxiliary decoding, the apparatus comprising:

the determining module is used for determining the candidate text with the highest updating score as the decoding text of the corpus to be decoded according to the updating score of each candidate text;

wherein, in the score updating module, the adopted updating formula is as follows:

S′＝S+λ(G_new-G_old)

7. The apparatus of claim 6, further comprising:

8. The apparatus of claim 7, wherein the training module comprises:

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the reverse NGram-based assisted decoding method of any of claims 1 to 5.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, carries out the steps of the reverse NGram-based auxiliary decoding method according to any one of claims 1 to 5.