CN112259084A

CN112259084A - Speech recognition method, apparatus and storage medium

Info

Publication number: CN112259084A
Application number: CN202010597703.2A
Authority: CN
Inventors: 吴川隆; 邓丽萍; 张超
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Huijun Technology Co.,Ltd.
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2021-01-22

Abstract

The disclosure provides a voice recognition method, a voice recognition device and a storage medium, and relates to the technical field of voice recognition. The disclosed speech recognition method includes: acquiring candidate lattice according to the voice signal of the current statement; resetting the neural network model according to the upper text corresponding to the current sentence, wherein the upper text is the recognition text of the previous sentence or a plurality of sentences of the current sentence; re-scoring the candidate lattice through the reset neural network model to obtain a re-scored lattice; and determining the recognition text of the current sentence according to the re-grading lattice. By the method, the information of one or more sentences in the current sentence can be considered for the speech recognition of the current sentence, so that the prior information is more fully utilized, the reprinting is more accurate, and the accuracy of the speech recognition is improved.

Description

Speech recognition method, apparatus and storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, and storage medium.

Background

The voice recognition is a key technology in systems such as voice quality inspection, man-machine conversation and the like, and is widely applied to the fields of logistics, finance, industry and the like. For example, in a dialogue robot, if the speech recognition accuracy is poor, the real intention of the speaker cannot be accurately understood, and an erroneous instruction is issued.

Disclosure of Invention

It is an object of the present disclosure to improve the accuracy of speech recognition.

According to an aspect of some embodiments of the present disclosure, there is provided a speech recognition method including: acquiring a lattice candidate according to a speech signal of a current statement; resetting a neural network model according to an upper text corresponding to the current sentence, wherein the upper text is an identification text of a previous sentence or a plurality of sentences of the current sentence, and the neural network model is generated based on corpus sample training with the upper text; re-scoring the candidate lattice through the reset neural network model to obtain a re-scored lattice; and determining the recognition text of the current sentence according to the re-grading lattice.

In some embodiments, the speech recognition method further comprises: and storing the identification text of the current sentence into a buffer area so as to be used as the previous text of the subsequent sentence.

In some embodiments, the speech recognition method further comprises: and acquiring the recognized text corresponding to the current sentence from the cache region.

In some embodiments, obtaining the candidate lattice from the speech signal of the current sentence comprises: and decoding the voice signal once based on the acoustic model and the language model to obtain candidate lattice.

In some embodiments, determining the recognition text of the current sentence from the re-scored lattice comprises: and (4) performing acoustic weight and language weight analysis on the heavily-scored lattice to obtain a decoding result of a path with the highest score as the recognition text of the current sentence.

In some embodiments, the neural network model comprises a LSTM (Long-Short Term Memory) model or a GRU (Gate Recurrent Unit) model.

In some embodiments, where the speech signal is of a conversation, the recognized text corresponding to the current sentence includes the speech signal of the utterance closest to the current sentence of the previous speaker of the current sentence.

In some embodiments, the speech recognition method further comprises: training the neural network model using the above samples until the output of the loss function converges, comprising: acquiring sample candidate lattice according to the voice signal of the current sample statement; resetting a neural network model to be trained according to an upper sample text corresponding to a current sample sentence, wherein the upper sample text is a sample text of a previous sentence or a plurality of sentences of the current sample sentence; the sample candidate lattice is re-scored through the reset neural network model to be trained, the re-scored sample lattice is obtained, and the recognition text of the current sample sentence is determined; and determining the output of the loss function according to the identification text of the current sample sentence and the sample text of the current sample sentence.

By the method, the information of one or more sentences in the current sentence can be considered for the speech recognition of the current sentence, so that the prior information is more fully utilized, the reprinting is more accurate, and the accuracy of the speech recognition is improved.

According to an aspect of further embodiments of the present disclosure, there is provided a speech recognition apparatus including: a decoding unit configured to acquire a candidate lattice from a speech signal of a current sentence; the reset unit is configured to reset the neural network model according to the recognized text corresponding to the current sentence, wherein the text is the recognized text of the previous sentence or multiple sentences of the current sentence, and the neural network model is generated based on corpus sample training with the text; the re-scoring unit is configured to re-score the candidate lattice through the reset neural network model to obtain a re-scored lattice; and the identification unit is configured to determine the identification text of the current sentence according to the re-grading lattice.

In some embodiments, the speech recognition apparatus further comprises: and the cache unit is configured to store the identification text of the current sentence into the cache region so as to serve as the text of the subsequent sentence.

In some embodiments, the reset unit is further configured to retrieve the identified above text corresponding to the current sentence from the buffer.

In some embodiments, the decoding unit is configured to decode the speech signal in one pass based on the acoustic model and the language model, obtaining the candidate lattice.

In some embodiments, the recognition unit is configured to perform acoustic weight and language weight analysis on the scoring lattice, and obtain a decoding result of a path with the highest score as the recognition text of the current sentence.

In some embodiments, the neural network model comprises an LSTM model or a GRU model.

In some embodiments, the speech recognition apparatus further comprises: a training unit configured to train the neural network model using the above-mentioned samples until the output of the loss function converges.

According to an aspect of some embodiments of the present disclosure, there is provided a speech recognition apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform any of the speech recognition methods mentioned above based on instructions stored in the memory.

The device can consider the information of one or more sentences in the current sentence in the speech recognition, thereby more fully utilizing the prior information, leading the re-scoring to be more accurate and improving the accuracy of the speech recognition.

According to an aspect of some embodiments of the present disclosure, a computer-readable storage medium is proposed, on which computer program instructions are stored, which instructions, when executed by a processor, implement the steps of any of the speech recognition methods mentioned above.

By executing the instructions on the computer-readable storage medium, the information of one or more sentences can be considered in the speech recognition of the current sentence, so that the prior information is more fully utilized, the re-scoring is more accurate, and the accuracy of the speech recognition is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:

fig. 1 is a flow diagram of some embodiments of a speech recognition method of the present disclosure.

FIG. 2 is a flow diagram of further embodiments of speech recognition methods of the present disclosure.

Fig. 3 is a schematic diagram of some embodiments of speech recognition devices of the present disclosure.

FIG. 4 is a schematic diagram of further embodiments of speech recognition apparatus of the present disclosure.

Fig. 5 is a schematic diagram of a speech recognition device according to still other embodiments of the present disclosure.

Detailed Description

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

The speech recognition system firstly utilizes a simple language model to carry out rapid decoding to generate a lattice network, and then utilizes a complex language model to re-score the generated lattice network so as to obtain higher recognition accuracy. The speech recognition rate obtained by one-time decoding is often low, and the accuracy can be further improved after the complex language model obtained by large corpus training is re-scored. The language model adopted for the re-scoring firstly adopts a high-order n-gram language model, and then the neural network replaces the scheme of performing lattice re-scoring by adopting the n-gram language model by virtue of the excellent modeling capability of the neural network.

The inventor finds that although the performance of the neural network is superior, the related art often performs the re-scoring according to the relation between the preceding and following words, and does not consider the logic between the preceding and following sentences.

A flow diagram of some embodiments of the speech recognition method of the present disclosure is shown in fig. 1.

In step 101, a candidate lattice is obtained from the speech signal of the current sentence.

In some embodiments, the speech signal may be decoded in one pass based on the acoustic model and the language model to obtain the candidate lattice. In some embodiments, one decoding pass can be performed in any manner in the related art to obtain the original lattice network, i.e., as the candidate lattice.

In step 102, the neural network model is reset according to the identified above text corresponding to the current sentence. The above text may be an identification text of a sentence or sentences preceding the current sentence, for example a predetermined number of sentences immediately preceding the current sentence, or a preceding sentence. In some embodiments, the paragraphs may be divided by speech interval time, or distinguished by keywords.

In some embodiments, the execution order of

steps

101, 102 may not be sequential.

In step 103, the candidate lattice is re-scored by the reset neural network model, and a re-scored lattice is obtained. In some embodiments, the scoring lattice may be analyzed for acoustic weight and language weight, and the decoding result of the path with the highest score is obtained as the recognition text of the current sentence.

In step 104, the recognition text of the current sentence is determined according to the re-scored lattice.

In some embodiments, where the speech signal is of a conversation, the recognized text corresponding to the current sentence includes the speech signal of the utterance closest to the current sentence of the previous speaker of the current sentence. In some embodiments, the speaker may be judged to have changed based on the tone.

By the method, the question-answer logic in the communication process can be fully utilized, and the accuracy of voice recognition is further improved.

A flow diagram of further embodiments of the speech recognition method of the present disclosure is shown in fig. 2.

In step 201, a speech signal is decoded once based on the acoustic model and the low-order language model to obtain candidate lattice.

In step 202, the recognized text corresponding to the current sentence is retrieved from the buffer. In some embodiments, the corresponding text above may be retrieved in the buffer according to a predetermined policy, which may include determining the recognized text of the last speaker speaking in proximity to the text, or the recognized text of the last sentence, last paragraph.

In step 203, the neural network model is reset based on the above text obtained from the buffer.

In step 204, the candidate lattice is re-scored by the reset neural network model, and a re-scored lattice is obtained. In some embodiments, the neural network model comprises an LSTM model or a GRU model.

In step 205, the scoring lattice is analyzed by acoustic weight and language weight, and the decoding result of the path with the highest score is obtained as the recognition text of the current sentence.

In step 206, the recognized text of the current sentence is stored in the buffer as the above text of the subsequent sentence.

By the method, the recognized text can be cached and managed in time to serve as a basis for recognizing the subsequent sentences; the neural network model is reset in time, the current statement is analyzed and estimated by utilizing the information, and the prediction accuracy of the language model is improved.

In some embodiments, the neural network model needs to be trained prior to speech recognition by any of the methods described above. The corpus sample needs to be provided with the above. In some embodiments, the training text with the above context may be obtained for training according to the corresponding application scenario, and when the result of the loss function converges to be stable (e.g., the output changes less than a predetermined value), the neural network training ends. In the testing stage, a sample candidate lattice can be obtained according to the speech signal of the current sample sentence, and the neural network model is reset through the above sample text corresponding to the current sample sentence. In some embodiments, the above sample text is a sample text of a sentence or sentences preceding the current sample sentence. And re-scoring the sample candidate lattice through the reset neural network model to be trained, and determining the optimal recognition text.

By the method, the neural network model can be trained on the basis of the corpus sample with the above, so that the generated neural network model has the capability of performing re-scoring by utilizing the logicality between the front sentence and the back sentence, and the accuracy of voice recognition is further improved.

Found after testing with the voice test data set. By the method in the disclosed embodiment, the PPL (Perplexity) of the single-layer LSTM neural language model is reduced from 43.2 to 40.05; meanwhile, the accuracy of voice recognition is absolutely improved by 0.7% by the repetition of Lattice scoring, and the improvement effect is obvious.

A schematic diagram of some embodiments of the speech recognition apparatus of the present disclosure is shown in fig. 3.

The decoding unit 301 can acquire a candidate from the speech signal of the current sentence. In some embodiments, the speech signal may be decoded in one pass based on the acoustic model and the language model to obtain the candidate lattice.

The reset unit 302 is capable of resetting the neural network model according to the recognized above text corresponding to the current sentence. The above text may be an identification text of a sentence or sentences preceding the current sentence, for example a predetermined number of sentences immediately preceding the current sentence, or a preceding sentence. In some embodiments, the paragraphs may be divided by speech interval time, or distinguished by keywords.

The re-scoring unit 303 can re-score the candidate lattice through the reset neural network model, and obtain a re-scored lattice. In some embodiments, the scoring lattice may be analyzed for acoustic weight and language weight, and the decoding result of the path with the highest score is obtained as the recognition text of the current sentence.

The recognition unit 304 can determine the recognition text of the current sentence from the re-scored lattice.

In some embodiments, as shown in fig. 3, the speech recognition apparatus may further include a buffer unit 305 capable of storing the recognition text of the current sentence in a buffer so as to be the above text of the subsequent sentence. The resetting unit 302 can obtain the recognized text corresponding to the current sentence from the buffer, and reset the neural network model according to the obtained text. In some embodiments, the corresponding text above may be retrieved in the buffer according to a predetermined policy, which may include determining the recognized text of the last speaker speaking in proximity to the text, or the recognized text of the last sentence, last paragraph.

The device can buffer and manage the recognized text in time to serve as a basis for recognizing the subsequent sentences; the neural network model is reset in time, the current statement is analyzed and estimated by utilizing the information, and the prediction accuracy of the language model is improved.

In some embodiments, as shown in fig. 3, the speech recognition apparatus may further include a training unit 306 capable of training the neural network model until the output of the loss function converges, generating a re-scoring unit 303. Training the corpus sample based on the above needs to be provided. In some embodiments, the training unit 306 may perform training based on the initial speech recognition apparatus shown in fig. 3, and input the corpus sample into the decoding unit 301, and obtain a sample candidate from the speech signal of the current sample sentence; the method comprises the steps that a reset unit resets a neural network model to be trained through an upper sample text corresponding to a current sample sentence, a re-grading unit re-grades a sample candidate lattice through the reset neural network model to be trained to obtain a re-graded sample lattice, and an identification unit determines an identification text of the current sample sentence; the training unit 306 determines the output of the loss function according to the recognition text of the current sample sentence and the sample text of the current sample sentence, and if the training unit 306 determines that the change of the output is smaller than the predetermined value, it is determined that the output is converged, and the training of the neural network model is completed.

The device can train the neural network model based on the corpus sample with the above, so that the generated neural network model has the capability of carrying out re-scoring by utilizing the logicality between the front sentence and the back sentence, and the accuracy of voice recognition is further improved.

A schematic structural diagram of an embodiment of the speech recognition apparatus of the present disclosure is shown in fig. 4. The speech recognition device comprises a memory 401 and a processor 402. Wherein: the memory 401 may be a magnetic disk, flash memory, or any other non-volatile storage medium. The memory is for storing the instructions in the corresponding embodiments of the speech recognition method above. The processor 402 is coupled to the memory 401 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 402 is configured to execute instructions stored in the memory, and can make full use of prior information, so that the re-scoring is more accurate, and the accuracy of speech recognition is improved.

In one embodiment, as also shown in FIG. 5, the speech recognition apparatus 500 includes a memory 501 and a processor 502. The processor 502 is coupled to the memory 501 by a BUS 503. The speech recognition apparatus 500 may also be connected to an external storage 505 via a storage interface 504 for invoking external data, and may also be connected to a network or another computer system (not shown) via a network interface 506. And will not be described in detail herein.

In the embodiment, the data instruction is stored in the memory, and the instruction is processed by the processor, so that the prior information can be more fully utilized, the re-scoring is more accurate, and the accuracy of voice recognition is improved.

In another embodiment, a computer-readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the steps of the method in the corresponding embodiment of the speech recognition method. As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Thus far, the present disclosure has been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Finally, it should be noted that: the above examples are intended only to illustrate the technical solutions of the present disclosure and not to limit them; although the present disclosure has been described in detail with reference to preferred embodiments, those of ordinary skill in the art will understand that: modifications to the specific embodiments of the disclosure or equivalent substitutions for parts of the technical features may still be made; all such modifications are intended to be included within the scope of the claims of this disclosure without departing from the spirit thereof.

Claims

1. A speech recognition method comprising:

acquiring a lattice candidate according to a speech signal of a current statement;

resetting a neural network model according to an upper text corresponding to a current sentence, wherein the upper text is an identification text of a previous sentence or a plurality of sentences of the current sentence, and the neural network model is generated based on corpus sample training with the upper text;

re-scoring the candidate lattice through the reset neural network model to obtain a re-scored lattice;

and determining the identification text of the current sentence according to the re-scoring lattice.

2. The method of claim 1, further comprising:

and storing the identification text of the current sentence into a buffer area so as to be used as the previous text of the subsequent sentence.

3. The method of claim 2, further comprising:

and acquiring the text corresponding to the current sentence from the cache region.

4. The method of claim 1, wherein the obtaining a candidate lattice from the speech signal of the current sentence comprises:

and decoding the voice signal once based on the acoustic model and the language model to obtain the candidate lattice.

5. The method of claim 1, wherein said determining an identification text of said current sentence from said re-scoring lattice comprises:

and performing acoustic weight and language weight analysis on the re-scored lattice to obtain a decoding result of a path with the highest score, wherein the decoding result is used as the identification text of the current sentence.

6. The method of claim 1, wherein the neural network model comprises an LSTM model or a GRU model.

7. The method of claim 1, wherein, in the case where the speech signal is a speech signal of a conversation,

the above text corresponding to the current sentence includes a speech signal of a previous speaker of the current sentence that is closest to the utterance of the current sentence.

8. The method of any of claims 1-7, further comprising:

training the neural network model with the above samples until the output of the loss function converges, comprising:

acquiring sample candidate lattice according to the voice signal of the current sample statement;

resetting a neural network model to be trained according to an upper sample text corresponding to a current sample sentence, wherein the upper sample text is a sample text of a previous sentence or a plurality of sentences of the current sample sentence;

the sample candidate lattice is re-scored through the reset neural network model to be trained, a re-scored sample lattice is obtained, and the recognition text of the current sample sentence is determined;

and determining the output of the loss function according to the identification text of the current sample sentence and the sample text of the current sample sentence.

9. A speech recognition apparatus comprising:

a decoding unit configured to acquire a lattice candidate from a speech signal of a current sentence;

the reset unit is configured to reset a neural network model according to an upper text corresponding to a current sentence, wherein the upper text is a recognition text of a previous sentence or multiple sentences of the current sentence, and the neural network model is generated based on corpus sample training with the upper text;

the re-scoring unit is configured to re-score the candidate lattice through the reset neural network model to obtain a re-scored lattice;

an identification unit configured to determine an identification text of the current sentence according to the re-scored lattice.

10. The apparatus of claim 9, further comprising:

and the cache unit is configured to store the identification text of the current sentence into the cache region so as to serve as the text of the subsequent sentence.

11. The apparatus of claim 9 or 10, further comprising:

a training unit configured to train the neural network model using the above-mentioned samples until an output of the loss function converges.

12. A speech recognition apparatus comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of any of claims 1-8 based on instructions stored in the memory.

13. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 8.