CN112908301A

CN112908301A - Voice recognition method, device, storage medium and equipment

Info

Publication number: CN112908301A
Application number: CN202110112058.5A
Authority: CN
Inventors: 申凯; 高建清
Original assignee: Iflytek Shanghai Technology Co ltd
Current assignee: Iflytek Shanghai Technology Co ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-06-04

Abstract

The application discloses a voice recognition method, a voice recognition device, a storage medium and equipment, wherein the method comprises the following steps: firstly, target voice to be recognized is obtained, acoustic features of the target voice are extracted, then first recognition is carried out on the target voice according to the acoustic features of the target voice and preset word boundary length, and a first recognition result is obtained; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is larger than the preset word boundary length; then, a final recognition result of the target speech may be determined based on the first recognition result and the second recognition result. Therefore, the target voice is recognized according to the preset word boundary length and the larger preset window length, and the final voice recognition result is determined by combining the two recognition results, so that the recognition instantaneity is considered, the recognition basis of the target voice is enriched, the time delay of the recognition result is reduced, and the accuracy of the recognition result is improved.

Description

Voice recognition method, device, storage medium and equipment

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, storage medium, and device.

Background

With the continuous breakthrough of artificial intelligence technology and the increasing popularization of various intelligent terminal devices, the frequency of human-computer interaction in daily work and life of people is higher and higher. As one of the most convenient and fast interactive modes, speech recognition is just an important link of human-computer interaction.

The existing speech recognition method usually controls the future visual field and the delay of the recognition result by means of forced window-cutting, starts to start the recognition model for recognition when a short voice segment with a fixed length is collected or cached, so as to achieve the streaming presentation effect of the recognition result, i.e. the recognized words can present the recognition result in real time when the user speaks. However, in order to achieve the streaming presentation effect, the window length of the forced window truncation adopted by the identification method is generally less than 600ms, which directly causes the mismatch problem when the model is trained and used. This is because the recognition model often has a deeper hierarchical structure, and the recognition model inputs a complete sentence of speech (which may be considered as full-field training) during training, and for the recognition of each frame of speech, it is theoretically possible to see the field of view of the complete sentence before and after the speech, and in the test stage of the recognition model, because of the adoption of the recognition mode of forced window truncation, it is not full-field recognition, that is, only the speech in front of the current speech frame can be seen, but the speech in the back cannot be seen, so that the field of view in the future is limited, and the mismatching of the training and test stages of the model will cause inaccuracy of the recognition result.

Therefore, how to improve the accuracy of the recognition result while reducing the delay of the speech recognition result is a technical problem to be solved urgently at present.

Disclosure of Invention

A primary objective of embodiments of the present application is to provide a speech recognition method, apparatus, storage medium, and device, which can reduce the delay of a recognition result and improve the accuracy of the recognition result when performing speech recognition.

The embodiment of the application provides a voice recognition method, which comprises the following steps:

acquiring target voice to be recognized;

extracting acoustic features of the target voice;

according to the acoustic characteristics of the target voice and the preset word boundary length, performing first recognition on the target voice to obtain a first recognition result; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is larger than the preset word boundary length;

and determining a final recognition result of the target voice according to the first recognition result and the second recognition result.

In a possible implementation manner, the target voice is subjected to first recognition according to the acoustic feature of the target voice and a preset word boundary length, so as to obtain a first recognition result; and carrying out second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the second recognition result comprises the following steps:

calculating a first target phoneme corresponding to the target voice according to the acoustic characteristics of the target voice and the preset word boundary length; according to the first target phoneme, the target voice is recognized to obtain a first recognition result;

calculating a second target phoneme corresponding to the target voice according to the acoustic characteristics of the target voice and a preset window length; and recognizing the target voice according to the second target phoneme to obtain a second recognition result.

In a possible implementation manner, the determining a final recognition result of the target speech according to the first recognition result and the second recognition result includes:

calculating the distance between the second recognition result and the first recognition result;

when the distance is larger than a preset distance threshold value, taking the second recognition result as a final recognition result of the target voice; and when the distance is not larger than the preset distance threshold, taking the first recognition result as a final recognition result of the target voice.

inputting the acoustic characteristics of the target voice into a pre-constructed voice recognition model, and recognizing the target voice to obtain a first recognition result and a second recognition result;

wherein the speech recognition model comprises an input layer, a depth convolution layer, a word boundary prediction network and a decoding layer.

In one possible implementation, the decoding layer comprises a real-time decoding layer and a full-view decoding layer; the inputting the acoustic features of the target voice into a pre-constructed voice recognition model, and recognizing the target voice to obtain a first recognition result and a second recognition result, includes:

sequentially inputting the acoustic features of the target voice to the deep convolutional layer through the input layer; coding the acoustic characteristics of the target voice by using the depth convolution layer to obtain a voice coding result;

predicting the voice coding result by utilizing the word boundary prediction network, or predicting the acoustic characteristics of the target voice by utilizing the word boundary prediction network to obtain the preset word boundary length;

decoding the voice coding result by utilizing the real-time decoding layer according to the preset word boundary length to obtain a first recognition result of the target voice; and decoding the voice coding result by utilizing the full-view decoding layer according to the preset window length of the full view to obtain a second recognition result of the target voice.

In a possible implementation manner, the speech recognition model is constructed as follows:

acquiring sample voice;

extracting acoustic features of the sample voice;

and training an initial voice recognition model according to the acoustic features of the sample voice and the text recognition label corresponding to the sample voice to generate the voice recognition model.

In a possible implementation, the method further includes:

acquiring verification voice;

extracting acoustic features of the verification voice;

inputting the acoustic features of the verification voice into the voice recognition model to obtain a text recognition result of the verification voice;

and when the text recognition result of the verification voice is inconsistent with the text marking result corresponding to the verification voice, the verification voice is used as the sample voice again, and the voice recognition model is updated.

An embodiment of the present application further provides a speech recognition apparatus, including:

the device comprises a first acquisition unit, a second acquisition unit and a voice recognition unit, wherein the first acquisition unit is used for acquiring target voice to be recognized;

a first extraction unit, configured to extract an acoustic feature of the target speech;

the recognition unit is used for carrying out first recognition on the target voice according to the acoustic characteristics of the target voice and the preset word boundary length to obtain a first recognition result; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is larger than the preset word boundary length;

and the determining unit is used for determining the final recognition result of the target voice according to the first recognition result and the second recognition result.

In a possible implementation manner, the identification unit is specifically configured to:

In a possible implementation manner, the determining unit includes:

the calculating subunit is used for calculating the distance between the second recognition result and the first recognition result;

the determining subunit is configured to, when the distance is greater than a preset distance threshold, take the second recognition result as a final recognition result of the target speech; and when the distance is not larger than the preset distance threshold, taking the first recognition result as a final recognition result of the target voice.

In one possible implementation, the decoding layer comprises a real-time decoding layer and a full-view decoding layer; the identification unit includes:

the coding subunit is used for sequentially inputting the acoustic features of the target voice to the depth convolution layer through the input layer; coding the acoustic characteristics of the target voice by using the depth convolution layer to obtain a voice coding result;

the prediction subunit is configured to predict the speech coding result by using the word boundary prediction network, or predict an acoustic feature of the target speech by using the word boundary prediction network to obtain the preset word boundary length;

the decoding subunit is configured to decode, by using the real-time decoding layer and according to the preset word boundary length, the speech coding result to obtain a first recognition result of the target speech; and decoding the voice coding result by utilizing the full-view decoding layer according to the preset window length of the full view to obtain a second recognition result of the target voice.

In a possible implementation manner, the apparatus further includes:

a second obtaining unit configured to obtain a sample voice;

a second extraction unit, configured to extract an acoustic feature of the sample speech;

and the training unit is used for training an initial voice recognition model according to the acoustic features of the sample voice and the text recognition label corresponding to the sample voice to generate the voice recognition model.

In a possible implementation manner, the apparatus further includes:

a third acquisition unit configured to acquire a verification voice;

a third extraction unit configured to extract an acoustic feature of the verification speech;

an obtaining unit, configured to input an acoustic feature of the verification speech into the speech recognition model, and obtain a text recognition result of the verification speech;

and the updating unit is used for taking the verification voice as the sample voice again and updating the voice recognition model when the text recognition result of the verification voice is inconsistent with the text marking result corresponding to the verification voice.

An embodiment of the present application further provides a speech recognition device, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform any one implementation of the above-described speech recognition method.

An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is caused to execute any implementation manner of the voice recognition method.

The embodiment of the present application further provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any implementation manner of the above speech recognition method.

According to the voice recognition method, the voice recognition device, the storage medium and the equipment, firstly, target voice to be recognized is obtained, acoustic features of the target voice are extracted, then, according to the acoustic features of the target voice, first recognition is carried out on the target voice according to the preset word boundary length, and a first recognition result is obtained; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is larger than the preset word boundary length; then, a final recognition result of the target speech may be determined based on the first recognition result and the second recognition result. Therefore, according to the embodiment of the application, the target voice is recognized according to the preset word boundary length and the larger preset window length respectively, and the final voice recognition result is determined by combining the two obtained recognition results, so that the recognition instantaneity is considered, the recognition basis of the target voice is enriched, the time delay of the recognition result is reduced, and the accuracy of the recognition result is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a speech recognition model provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a small-delay streaming branch in a real-time decoding layer according to an embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating a process of constructing a speech recognition model according to an embodiment of the present application;

FIG. 5 is a schematic flow chart illustrating verification of a speech recognition model according to an embodiment of the present application;

fig. 6 is a schematic composition diagram of a speech recognition apparatus according to an embodiment of the present application.

Detailed Description

The existing speech recognition method usually adopts a window-forcing mode to perform speech recognition, that is, the existing recognition model starts to be started to recognize when a short speech segment with a fixed length is collected or cached to achieve the streaming presentation effect of the recognition result, but in order to achieve the streaming presentation effect, the window length of the window-forcing adopted in the practical use of the recognition method is generally less than 600ms, which directly causes the problem of mismatching between model training and model use. This is because the recognition model often has a deeper hierarchical structure, and the recognition model inputs a complete sentence of speech during training (which can be considered as full-field training), for the recognition of each frame of speech, the field of view of the whole sentence before and after the frame of speech can be theoretically seen, and in the testing stage of the recognition model, because of the adoption of the recognition mode of forced window truncation, the recognition model is not full-field recognition, that is, only the speech in front of the current speech frame can be seen, but the speech in the back cannot be seen, so that the field of view in the future is limited, and the mismatching of the training and testing stages of the model will cause inaccuracy of the recognition result.

At present, in order to solve the problem of inaccuracy of recognition results caused by mismatching of training and testing stages of a model, a matching training mode can be adopted, but convolution layer calculation and attention layer calculation of the model are deepened along with a network, and visual fields are synchronously and linearly superposed, so that the network with more network layers is not favorable for matching training on a model structure, and only on a data level, a sentence is split into data blocks with a plurality of window lengths to be respectively subjected to network calculation, but the training also can enable training calculation amount and training duration to be multiplied. And the problem of inaccurate identification result caused by mismatching of training and use of the current model cannot be effectively solved.

In order to solve the defects, the application provides a voice recognition method, which includes the steps of firstly obtaining target voice to be recognized, extracting acoustic features of the target voice, and then carrying out first recognition on the target voice according to the acoustic features of the target voice and preset word boundary length to obtain a first recognition result; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is larger than the preset word boundary length; then, a final recognition result of the target speech may be determined based on the first recognition result and the second recognition result. Therefore, according to the embodiment of the application, the target voice is recognized according to the preset word boundary length and the larger preset window length respectively, and the final voice recognition result is determined by combining the two obtained recognition results, so that the recognition instantaneity is considered, the recognition basis of the target voice is enriched, the time delay of the recognition result is reduced, and the accuracy of the recognition result is improved.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First embodiment

Referring to fig. 1, a schematic flow chart of a speech recognition method provided in this embodiment is shown, where the method includes the following steps:

s101: and acquiring target voice to be recognized.

In this embodiment, any voice subjected to voice recognition by the present embodiment is defined as a target voice. In addition, the present embodiment does not limit the language type of the target speech, for example, the target speech may be a chinese speech, an english speech, or the like; meanwhile, the embodiment also does not limit the length of the target speech, for example, the target speech may be a sentence, or multiple sentences.

It can be understood that the target voice can be obtained by recording and the like according to actual needs, for example, phone call voice or conference recording and the like in daily life of people can be used as the target voice, and after the target voice is obtained, the scheme provided by the embodiment can be used for recognizing the target voice.

S102: and extracting acoustic features of the target voice.

In this embodiment, after the target speech to be recognized is acquired in step S101, in order to accurately recognize text information corresponding to the target speech, it is necessary to extract an acoustic feature of the target speech by using a feature extraction method, and use the acoustic feature as a recognition basis for realizing effective recognition of the target speech through subsequent steps S103 to S104.

Specifically, when extracting the acoustic features of the target speech, firstly, framing the target speech is required to obtain a corresponding speech frame sequence, and then pre-emphasizing the framed speech frame sequence; and then, extracting the acoustic features of each speech frame in sequence, where the acoustic features refer to feature data used for characterizing the acoustic information of the corresponding speech frame, and may be Mel-scale Frequency Cepstral Coefficients (MFCCs) features or Perceptual Linear Prediction (PLP) features, for example.

It should be noted that, the embodiment of the present application does not limit the method for extracting the acoustic features of the target speech, nor the specific extraction process, and an appropriate extraction method may be selected according to the actual situation, and corresponding feature extraction operations may be performed.

S103: according to the acoustic characteristics of the target voice and the preset word boundary length, performing first recognition on the target voice to obtain a first recognition result; and carrying out second recognition on the target voice according to the preset window length to obtain a second recognition result, wherein the preset window length is larger than the preset word boundary length.

In this embodiment, after the acoustic feature of the target speech is extracted in step S102, in order to reduce the delay of the recognition result, the acoustic feature of the target speech may be further processed according to a preset minimum window length (i.e., a word boundary length much shorter than the window length of the currently used forced truncation window) to determine the recognition result of the target speech according to the processing result, which is defined as the first recognition result. Therefore, compared with the current recognition mode which adopts the window length of the forced window truncation (such as 200ms, which is far higher than the word boundary length used for representing the length of one word), the time delay of the recognition result can be effectively reduced.

Where the word boundary length refers to a predetermined window length that causes the recognition result to be delayed less. The preferred mode may be determined by a word boundary prediction network introduced in the subsequent step, so as to better conform to the length of the target speech and reduce the recognition delay, or may be determined by other modes according to the actual situation, which is not limited in this embodiment of the present application. In addition, the word boundary length set in the present application is not limited to the length of one word, and speech recognition may be performed for a plurality of words at a time.

Specifically, in an alternative implementation manner, after the acoustic features of the target speech are extracted, in order to effectively reduce the delay of the recognition result, each minimum speech unit (i.e., a phoneme, which refers to the minimum unit or the minimum speech segment constituting each syllable of the target speech) constituting the target speech may be calculated according to a preset word boundary length, which is defined as a first target phoneme, and then the first target phoneme is analyzed to realize the recognition of the target speech according to the obtained processing result, so as to obtain a first recognition result with a smaller delay.

After the first target phoneme of the target speech is calculated, the first target phoneme may be further decoded by using an existing or future decoding method (such as a Viterbi (Viterbi) algorithm), so as to determine a decoding result corresponding to the target speech as the first recognition result. The specific decoding process is consistent with the existing method, and is not described herein again.

Meanwhile, in order to improve the accuracy of the recognition result of the target speech, after the acoustic features of the target speech are extracted in step S102, the acoustic features of the target speech need to be recognized by adopting a forced window-clipping recognition method, that is, the acoustic features of the target speech are processed according to a preset window length (for example, 1S) greater than the word boundary length, so as to determine the recognition result of the target speech according to the processing result, which is defined as a second recognition result. In this way, the acoustic feature information of the speech in the context within the long window length is fully considered when performing recognition, and although the delay of the recognition result is high, the accuracy of the recognition result can be effectively improved.

The preset window length refers to a preset window length which enables the decoding result accuracy to be high. It should be noted that, in order to obtain a more accurate recognition result, the length of the preset window length is usually set to be a larger length than the length of the word boundary, for example, 1s, that is, recognition is performed every 1s of target speech, so as to obtain a more accurate recognition result by using a richer recognition basis in the large field of view.

Specifically, in an alternative implementation manner, after the acoustic features of the target speech are extracted, in order to improve the accuracy of the recognition result, each minimum speech unit (i.e., phoneme) constituting the target speech may be calculated according to a preset window length greater than the word boundary length, where the minimum speech unit is defined as a second target phoneme, and then the second target phoneme is analyzed to realize the recognition of the target speech according to the obtained processing result, so as to obtain a second recognition result with higher accuracy.

After the second target phoneme of the target speech is calculated, the second target phoneme may be further decoded by using an existing or future decoding method (such as a Viterbi (Viterbi) algorithm), so as to determine a decoding result corresponding to the target speech, which is used as the second recognition result. The specific decoding process is consistent with the existing method, and is not described herein again.

S104: and determining a final recognition result of the target voice according to the first recognition result and the second recognition result.

In this embodiment, after the first recognition result with a smaller delay and the second recognition result with a higher accuracy corresponding to the target voice are obtained in step S103, the distance between the first recognition result and the second recognition result may be further calculated to determine the degree of difference between the first recognition result and the second recognition result, and then according to the degree of difference between the first recognition result and the second recognition result, the second recognition result is used to make up for the deficiency of the first recognition result in accuracy.

Specifically, in an alternative implementation manner, the specific implementation process of this step S104 may include the following steps a-B:

step A: and calculating the distance between the second recognition result and the first recognition result.

In this implementation manner, after the first recognition result with a smaller delay and the second recognition result with a higher accuracy corresponding to the target speech are calculated in step S103, the distance between the first recognition result and the second recognition result can be further calculated to execute the subsequent step B. For example, the edit distance between the second recognition result and the first recognition result can be calculated according to the edit distance criterion, so as to execute the subsequent step B.

For example, the following steps are carried out: assuming that the first recognition result is "i love singing", and the second recognition result is "i love dancing", the number of editing times required to make the second recognition result the same as the first recognition result may be calculated to be 2 according to the editing distance criterion, that is, the number of editing times required to make "i love dancing" the same as "i love singing" is 2.

It is understood that if the number of edits required to make the second recognition result identical to the first recognition result is 0, it indicates that the second recognition result is identical to the first recognition result.

And B: when the editing distance is larger than a preset distance threshold value, taking the second recognition result as a final recognition result of the target voice; and when the distance is not greater than the preset distance threshold, taking the first recognition result as a final recognition result of the target voice.

It should be noted that, because the first recognition result is a recognition result obtained by processing the acoustic features of the target speech according to the preset word boundary length, the first recognition result has stronger instantaneity and smaller delay; accordingly, since the second recognition result is obtained by processing the acoustic features of the target speech according to the preset window length (i.e., the large visual field, such as 1s) with the length larger than the word boundary length, the recognition basis of the second recognition result is richer, and the accuracy is higher although the delay of the second recognition result is larger than that of the first recognition result.

Therefore, when the target voice is recognized, the accuracy of the recognition result can be improved while the time delay of the recognition result is reduced. After the distance (e.g., the edit distance) between the second recognition result and the first recognition result of the target speech is calculated in step a, it may be further determined whether the distance is greater than a preset distance threshold, and if so, it indicates that the recognition accuracy of the first recognition result is too low to meet the user's requirement. On the contrary, if the distance between the second recognition result and the first recognition result is judged to be not greater than the preset distance threshold, it is indicated that although the accuracy of the first recognition result is not higher than the second recognition result, the requirement of the user can be met, and in order to reduce the delay of the recognition result and achieve the streaming presentation effect, the first recognition result with smaller delay can be used as the final recognition result of the target voice.

It should be noted that specific values of the preset distance threshold may be set according to actual situations, which is not limited in the embodiment of the present application, for example, taking the distance as the edit distance, the preset distance threshold is the preset edit distance threshold, and may be set to 3.

It should be further noted that, in the present application, specific implementation processes of the step S103 and the step S104 are not limited, and may be implemented by a speech recognition model described later, or other implementation manners may be selected according to actual situations. The embodiment of the present application will be described with reference to an example in which the recognition process of the above steps S103 and S104 is implemented by using a pre-constructed speech recognition model:

specifically, in step S103, the target speech is first recognized according to the acoustic feature of the target speech and the preset word boundary length, so as to obtain a first recognition result; and performing second recognition on the target voice according to the preset window length to obtain a second recognition result, wherein the implementation process of obtaining the second recognition result' may include: and inputting the acoustic characteristics of the target voice into a pre-constructed voice recognition model, and recognizing the target voice to obtain a first recognition result and a second recognition result. Referring to fig. 2, among other things, the speech recognition model may include an input layer, a depth convolution layer, a word boundary prediction network, and a decoding layer (including a real-time decoding layer and a full-view decoding layer, as shown in fig. 2). The specific implementation process of recognizing the target speech by using the speech recognition model may include the following steps C1-C3:

step C1: inputting the acoustic features of the target voice to the depth convolution layer in sequence through the input layer; and coding the acoustic characteristics of the target voice by using the depth convolution layer to obtain a voice coding result.

In this implementation, after extracting the acoustic feature vector (e.g., MCFF) of the target speech, the acoustic feature vector of the target speech may be further segmented into independent data blocks according to the fixed window length and window shift by the input layer and sequentially input to the deep convolutional layer (e.g., the deep convolutional network shown in fig. 2) in time order, and the acoustic feature of the target speech "i love beijing tiananmen" may be segmented according to the window length and window shift of 200ms to obtain each independent data block, and sequentially input to the deep convolutional layer in time order.

Furthermore, the acoustic features of the target speech are sequentially input into the deep convolutional layer, and each frame of target speech is encoded to obtain a speech encoding result (i.e., a high-order speech feature representation vector corresponding to the target speech). Wherein, the depth convolution layer is composed of a depth convolution network structure. For example, the speech recognition model may be composed of a Long Short-Term Memory network (LSTM), where the network structure may employ a convolution kernel of 3 × 2 to ensure that the field of view of the recognition model only sees the historical speech frame of the current target speech, and does not need the information of the future target speech frame, thereby ensuring that the deep convolutional layer has no delay in encoding. The specific convolutional encoding process is the same as that of the prior art, and is not described herein again.

Step C2: and predicting the voice coding result by using a word boundary prediction network, or predicting the acoustic characteristics of the target voice by using the word boundary prediction network to obtain the preset word boundary length.

In this implementation manner, the acoustic features of the target speech are sequentially input to the depth convolution layer through the input layer of the model, and the acoustic features of the target speech are encoded by using the depth convolution layer to obtain a speech encoding result, and then the speech encoding result may be input to the word boundary prediction network to predict the speech encoding result by using the word boundary prediction network to obtain a preset word boundary length for determining the window length and window shift adopted by the real-time decoding layer during decoding.

In particular, the word boundary prediction network may include two layers of LSTM and two layers of Deep Neural Networks (DNNs). After the speech coding result is obtained, the speech coding result may be input to the word boundary prediction network frame by frame in a manner of splicing front and back multiframes to include certain context information, and obtain the probability of whether each frame of speech has a word boundary, where the value of α is defined as α, the value of α is between 0 and 1, the higher the value (the closer to 1), the larger the frame position is, the word jump boundary may exist, and then the speech frame corresponding to the speech coding result in the two word boundaries may be determined by defining a preset probability threshold value to be used as a preset word boundary length, and the subsequent step C3 is executed.

For example, as shown in fig. 2, assuming that the preset probability threshold is 0.7, when the speech coding result is input to the word boundary prediction network frame by frame, when the word boundary prediction network output probability α 2 is 0.9 when the 3 rd frame is input and exceeds the preset probability threshold 0.7, there may be turning points of "i" and "i", and then 3 frames may be used as the preset word boundary length.

Alternatively, the acoustic features of the target speech may be directly predicted by using a word boundary prediction network to obtain a preset word boundary length, as indicated by a dashed arrow in fig. 2. And then, a real-time decoding layer can be used for decoding the voice coding result according to the preset word boundary length to obtain a first decoding result of the target voice, the specific implementation process is similar to the above process, and only the voice coding result is replaced by the acoustic feature of the target voice, which is not described herein again.

Step C3: decoding a voice coding result by utilizing a real-time decoding layer according to a preset word boundary length to obtain a first recognition result of the target voice; and decoding the voice coding result by utilizing the full-view decoding layer according to the preset window length of the full view to obtain a second recognition result of the target voice.

After the preset word boundary length is determined through the step C, the speech coding result can be further decoded by using the real-time decoding layer according to the preset word boundary length to obtain a decoding result, and the decoding result is used as the first recognition result of the target speech. For example: based on the example in step C2, the determined speech coding result of each data block may be input into a real-time decoding layer (i.e., the small-delay streaming real-time path shown in fig. 2) to perform phoneme recognition, and decoding is performed according to the recognition result, so as to obtain a decoding result of the target speech "i love beijing tiananmen", which is used as the first recognition result.

It should be noted that, if the word jump boundary is not determined for a long time, in order to ensure normal operation of the target speech recognition process and real-time decoding, a maximum word boundary length may be preset, for example, 500ms, that is, when the word jump boundary is not detected within 500ms, a speech coding result corresponding to the maximum 500ms speech is forced to be input into the real-time decoding layer for phoneme recognition, and decoding is performed according to the recognition result to obtain a decoding result, which is a first decoding result of the target speech, so as to prevent an abnormal situation that the word boundary prediction network fails due to existence of noise and other complex scenes when a speaker pronounces speech, which causes an error in the recognition result.

Meanwhile, when the length of the buffered or accumulated speech coding result satisfies the preset window length of the wide-view or full-view (e.g., 1s), the buffered or accumulated speech coding result with the preset window length may be input into a full-view decoding layer (i.e., the wide-view/full-view branch shown in fig. 2), so as to decode the speech coding result by using the full-view decoding layer according to the preset window length of the full-view, for example, perform recognition on the speech coding result every 1s to obtain a decoding result of the target speech, and use the decoding result as a second recognition result of the target speech.

Therefore, the real-time decoding layer is utilized to decode the voice coding result according to the preset word boundary length, so that the obtained first decoding result of the target voice has stronger real-time performance and smaller time delay. And the full-view decoding layer is utilized to decode the voice coding result according to the preset window length of the full view, so that the accuracy of obtaining the second decoding result of the target voice is higher. Therefore, in order to ensure that the streaming recognition effect of the voice is presented, the first decoding result can be used as the preliminary recognition result of the target voice to ensure the real-time performance of decoding. Meanwhile, the second recognition result is utilized to make up the deficiency of the first recognition result in accuracy according to the distance between the second recognition result and the first recognition result, so as to determine the recognition result with smaller delay and higher accuracy as the final recognition result of the target voice.

For example, an edit distance between the second recognition result and the first recognition result (i.e., the number of edits required to make the second recognition result the same as the first recognition result) is taken as an example. If the number of edits required to make the second recognition result identical to the first recognition result is calculated to be 0, which indicates that the second recognition result completely matches the first recognition result, then the first recognition result may be used as the final recognition result of the target speech. If the required editing frequency is not 0 when the second recognition result is the same as the first recognition result, it is indicated that the second recognition result is not completely consistent with the first recognition result, and then it is further determined whether the editing frequency exceeds a preset frequency threshold, and if so, the first decoding result as the preliminary recognition result needs to be replaced by the second recognition result with higher accuracy as the final recognition result of the target speech, so as to solve the problem of low accuracy of the recognition result when the first decoding result is taken as the preliminary recognition result of the target speech. If not, the first recognition result can be used as the final recognition result of the target voice at this time.

For example, as shown in fig. 2, assuming that the preset number threshold is 2, and the first recognition result is "tianjin me", the second recognition result is "beijing me", and the edit distance between the two (i.e., the number of edits required to make the second recognition result "beijing me ti ze" the same as the first recognition result "tianjin me") is 3, the second recognition result "beijing me" may be used instead of the first recognition result "tianjin me" as the recognition result of the target voice, if the edit number threshold 2 is exceeded.

Furthermore, an alternative implementation is that the wide-view/full-view branch in the full-view decoding layer in the speech recognition model described above may employ a conventional multi-head attention structure. In order to solve the problem that the mismatch of the training and testing stages of the model causes inaccurate recognition results, that is, in order to control the superposition effect of the visual fields while increasing the network depth, so that the training and testing stages of the model are performed in a matching manner under a small visual field, the small-delay streaming branch in the real-time decoding layer in the speech recognition model adopts the structure shown in fig. 3. And splitting the future window into independent branches for calculation. The future window information of the first multi-head attention layer is cached, and is converted by the full connection layer to be used as the input of the future window of the next multi-head attention layer, so that the view field of the future window is cut off, and the network is overlapped in multiple layers but the view field is not increased linearly. And ensuring that the visual fields of the model in the training and using processes are completely matched in a mode of ensuring that the total visual field length of the network is consistent with the future window visual field of the first multi-head attention layer.

Specifically, it should be noted that in the currently conventional multi-head attention structure, the input of the n-th multi-head attention neural network layer is XⁿGenerating Q through the full connection layerⁿ、Kⁿ、VⁿThree vectors, i.e. Qⁿ(Query)、Kⁿ(Key)、Vⁿ(Value), the physical meaning of the process simulationComprises the following steps: by the current time QⁿK as Query to traverse adjacent time instantsⁿTo obtain the importance degree of each adjacent time to the current time, and the vector V of the adjacent time is obtained according to the importance degreeⁿAnd weighting to obtain a new characterization vector containing the context information at the moment. In the present embodiment, each time Q is calculatedⁿAdjacent time vector K within a certain window lengthⁿThe importance degree of the information contained in the voice frame at the adjacent moment to the current moment is obtained through the inner product of the soft max layer, and the representation of the importance degree is normalized through the soft max layer, so that a group of soft max coefficients is obtained. By the coefficient to the information V in the window lengthⁿAnd weighting to obtain a new characterization vector at the current moment. Typically, the window length will be centered on the current frame and will include a history window and a future window speech frame of a certain length, e.g., 31 frames each seen in the history and future centered on the current window. However, since the future window will bring the delay of the model, and the multi-layer multi-head attention layer overlaps the hard delay linear overlap of the future window, the loss of the effect will be caused by removing the view of the future window.

Therefore, in order to deepen the network but not increase the visual field in the model training and reserve a future window to reduce the effect loss to the maximum extent, as shown in fig. 3, the present application divides the future window into independent branches for calculation, that is, caches the future window information of the first multi-headed attention layer, that is, K¹、V¹The network total visual field length is consistent with the visual field of the future window of the first multi-head attention layer, so that the visual fields of the model in the training and using processes are ensured to be completely matched, the effect loss caused by mismatching of the training and using of the model is compensated, and the problem that the mismatching of the training and using of the model can cause inaccurate recognition results is solved. The conventional attention formula is also adjusted from the following formula (1) to the following formula (2):

Attention(Qⁿ,Kⁿ,Vⁿ)＝soft max(Qⁿ,Kⁿ)×Vⁿ (1)

Attention(Qⁿ，Kⁿ，Vⁿ)＝soft max(Qⁿ,Kⁿ,K¹)×(Vⁿ,V¹) (2)

wherein Q isⁿ、Kⁿ、VⁿRepresenting the representation vectors of Query, Key and Value of the nth multi-head attention layer respectively, K¹And V¹Key and Value characterizing vectors respectively representing the first multi-headed attention layer.

It should be noted that, for the specific construction process of the speech recognition model, reference may be made to the following description of the second embodiment.

In summary, in the speech recognition method provided in this embodiment, first, a target speech to be recognized is obtained, acoustic features of the target speech are extracted, and then, according to the acoustic features of the target speech and a preset word boundary length, a first recognition result is obtained by performing a first recognition on the target speech; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is larger than the preset word boundary length; then, a final recognition result of the target speech may be determined based on the first recognition result and the second recognition result. Therefore, according to the embodiment of the application, the target voice is recognized according to the preset word boundary length and the larger preset window length respectively, and the final voice recognition result is determined by combining the two obtained recognition results, so that the recognition instantaneity is considered, the recognition basis of the target voice is enriched, the time delay of the recognition result is reduced, and the accuracy of the recognition result is improved.

Second embodiment

The present embodiment will describe a process of constructing a speech recognition model mentioned in the above embodiments.

Referring to fig. 4, a schematic diagram of a process for building a speech recognition model provided in this embodiment is shown, where the process includes the following steps:

s401: sample speech is obtained.

In this embodiment, in order to construct a speech recognition model, a large amount of preparation work needs to be performed in advance, first, speech data sent by a large amount of users during speaking needs to be collected, for example, sound can be collected by a microphone array, the sound collecting device may be a tablet computer, or an intelligent hardware device, such as an intelligent sound, a television, an air conditioner, and the like, at least thousands of hours of speech data need to be collected, and noise reduction processing is performed on the speech data, the speech data can cover various application scenarios (such as vehicle-mounted, home, and the like), and then each piece of collected speech data of the users can be respectively used as sample speech, and text information corresponding to the sample speech is manually marked in advance to train the speech recognition model.

S402: and extracting acoustic features of the sample voice.

In this embodiment, after the sample speech is obtained in step S401, the sample speech cannot be directly used for training to generate a speech recognition model, but a method similar to the method for extracting the acoustic feature of the target speech in step S102 in the first embodiment needs to be adopted, and the acoustic feature of each sample speech can be extracted by replacing the target speech pair with the sample speech.

S403: and training the initial voice recognition model according to the acoustic characteristics of the sample voice and the text recognition label corresponding to the sample voice to generate a voice recognition model.

During the training of the current round, the target speech in the first embodiment may be replaced by the sample speech obtained in the current round, and the recognition result corresponding to the sample speech may be output through the current initial speech recognition model and according to the execution process in the first embodiment.

Specifically, according to the steps C1-C3 in the first embodiment, after the acoustic features of the sample speech are extracted, the recognition result corresponding to the sample speech is determined through the initial speech recognition model. Then, the recognition result can be compared with the manually marked text information corresponding to the sample voice, and the model parameters are updated according to the difference between the recognition result and the manually marked text information until a preset condition is met, for example, the model parameters are stopped from being updated when the preset training times are reached, the training of the voice recognition model is completed, and a trained voice recognition model is generated.

Through the embodiment, the voice recognition model can be generated according to the sample voice training, and further, the generated voice recognition model can be verified by utilizing verification voice. The specific verification process may include the following steps S501 to S504:

step S501: and acquiring verification voice.

In this embodiment, in order to implement the verification of the speech recognition model, first, a verification speech needs to be obtained, where the verification speech refers to audio information that can be used for performing the speech recognition model verification, and after the verification speech is obtained, the subsequent step S502 may be continuously performed.

Step S502: acoustic features of the verification speech are extracted.

After the verification voice is acquired in step S501, the verification voice cannot be directly used for verifying the voice recognition model, but the acoustic feature of the verification voice needs to be extracted first, and then the obtained voice recognition model is verified according to the acoustic feature of the verification voice.

Step S503: and inputting the acoustic characteristics of the verification voice into the voice recognition model to obtain a text recognition result of the verification voice.

After the acoustic features of the verification speech are extracted in step S502, the acoustic features of the verification speech may be further input into the speech recognition model to obtain a text recognition result of the verification speech, so as to execute the subsequent step S504.

Step S504: and when the text recognition result of the verification voice is inconsistent with the text marking result corresponding to the verification voice, the verification voice is used as the sample voice again, and the voice recognition model is updated.

After the text recognition result of the verification speech is obtained in step S503, if the text recognition result of the verification speech is inconsistent with the manually labeled text labeling result corresponding to the verification speech, the verification speech may be used as the sample speech again, and the parameters of the speech recognition model are updated.

Through the embodiment, the voice recognition model can be effectively verified by using the verification voice, and when the text recognition result of the verification voice is inconsistent with the manually marked text marking result corresponding to the verification voice, the voice recognition model can be timely adjusted and updated, so that the recognition precision and accuracy of the recognition model can be improved.

In summary, with the speech recognition model trained in this embodiment, the target phoneme corresponding to the target speech can be calculated according to the preset word boundary length and the preset window length, and the target phoneme can be used as a richer recognition basis to recognize the text information corresponding to the target speech, so that when the target speech is recognized, the delay of the recognition result can be reduced, and the accuracy of the recognition result can be improved.

Third embodiment

In this embodiment, a speech recognition apparatus will be described, and for related contents, please refer to the above method embodiment.

Referring to fig. 6, a schematic diagram of a voice recognition apparatus provided in this embodiment is shown, where the apparatus 600 includes:

a first acquisition unit 601 configured to acquire a target voice to be recognized;

a first extraction unit 602, configured to extract an acoustic feature of the target speech;

the recognition unit 603 is configured to perform first recognition on the target speech according to the acoustic feature of the target speech and a preset word boundary length, so as to obtain a first recognition result; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is larger than the preset word boundary length;

a determining unit 604, configured to determine a final recognition result of the target speech according to the first recognition result and the second recognition result.

In an implementation manner of this embodiment, the identifying unit 603 is specifically configured to:

In an implementation manner of this embodiment, the determining unit 604 includes:

In one implementation manner of the embodiment, the decoding layer comprises a real-time decoding layer and a full-view decoding layer; the recognition unit 603 includes:

In an implementation manner of this embodiment, the apparatus further includes:

a second obtaining unit configured to obtain a sample voice;

In an implementation manner of this embodiment, the apparatus further includes:

a third acquisition unit configured to acquire a verification voice;

Further, an embodiment of the present application further provides a speech recognition device, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any one of the implementation methods of the voice recognition method.

Further, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the instructions cause the terminal device to execute any implementation method of the foregoing speech recognition method.

Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation method of the above-mentioned speech recognition method.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech recognition method, comprising:

acquiring target voice to be recognized;

extracting acoustic features of the target voice;

2. The method according to claim 1, wherein the first recognition is performed on the target voice according to the acoustic feature of the target voice and the preset word boundary length to obtain a first recognition result; and carrying out second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the second recognition result comprises the following steps:

3. The method according to claim 2, wherein the determining a final recognition result of the target speech according to the first recognition result and the second recognition result comprises:

4. The method according to any one of claims 1 to 3, wherein the first recognition is performed on the target speech according to the acoustic feature of the target speech and the preset word boundary length to obtain a first recognition result; and carrying out second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the second recognition result comprises the following steps:

5. The method of claim 4, wherein said decoding layers comprise a real-time decoding layer and a full view decoding layer; the inputting the acoustic features of the target voice into a pre-constructed voice recognition model, and recognizing the target voice to obtain a first recognition result and a second recognition result, includes:

6. The method of claim 5, wherein the speech recognition model is constructed as follows:

acquiring sample voice;

extracting acoustic features of the sample voice;

7. The method of claim 6, further comprising:

acquiring verification voice;

extracting acoustic features of the verification voice;

8. A speech recognition apparatus, comprising:

9. A speech recognition device, comprising: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-7.

10. A computer-readable storage medium having stored therein instructions that, when executed on a terminal device, cause the terminal device to perform the method of any one of claims 1-7.

11. A computer program product, characterized in that the computer program product, when run on a terminal device, causes the terminal device to perform the method of any of claims 1-7.