CN112908301B

CN112908301B - Voice recognition method, device, storage medium and equipment

Info

Publication number: CN112908301B
Application number: CN202110112058.5A
Authority: CN
Inventors: 申凯; 高建清
Original assignee: Iflytek Shanghai Technology Co ltd
Current assignee: Iflytek Shanghai Technology Co ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2024-06-11
Anticipated expiration: 2041-01-27
Also published as: CN112908301A

Abstract

The application discloses a voice recognition method, a device, a storage medium and equipment, wherein the method comprises the following steps: firstly, acquiring target voice to be recognized, extracting acoustic characteristics of the target voice, and then carrying out first recognition on the target voice according to the acoustic characteristics of the target voice and the preset word boundary length to obtain a first recognition result; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is longer than the preset word boundary length; then, a final recognition result of the target voice may be determined according to the first recognition result and the second recognition result. Therefore, the target voice is respectively identified according to the preset word boundary length and the larger preset window length, and the final voice identification result is determined according to the two identification results obtained through synthesis, so that the real-time identification is considered, the identification basis of the target voice is enriched, the delay of the identification result is reduced, and the accuracy of the identification result is improved.

Description

Voice recognition method, device, storage medium and equipment

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, storage medium, and device.

Background

With the continuous breakthrough of artificial intelligence technology and the increasing popularization of various intelligent terminal devices, the occurrence frequency of human-computer interaction in daily work and life of people is higher and higher. Speech is one of the most convenient and quick interaction modes, and recognition of speech is an important link of man-machine interaction.

The existing speech recognition method generally controls future view and delay of recognition results by means of forced window cutting, and starts a recognition model to recognize when a phrase sound fragment with a fixed length is collected or buffered, so as to achieve the streaming effect of the recognition results, namely, the recognized words can present the recognition results in real time when a user speaks. However, in order to achieve the streaming effect, the length of the forced window is generally less than 600ms, which directly causes the problem of mismatching when the model is trained and used. This is because the recognition model often has a deeper hierarchical structure, and the recognition model inputs a complete sentence of the completed speech during training (which can be considered as full-field training), and for each frame of speech recognition, the field of view of the complete sentence of the speech before and after the complete sentence can be theoretically seen, while in the test stage of the recognition model, the recognition mode of forced window cutting is adopted, so that the recognition model is not full-field recognition, that is, only the speech before the current speech frame can be seen, and the later speech can not be seen, so that the future field of view is limited, and the inaccuracy of the recognition result can be caused by the mismatching of the training and the test stage of the model.

Therefore, how to improve the accuracy of the recognition result while reducing the delay of the voice recognition result is a technical problem to be solved at present.

Disclosure of Invention

The embodiment of the application mainly aims to provide a voice recognition method, a device, a storage medium and equipment, which can reduce the delay of a recognition result and improve the accuracy of the recognition result when performing voice recognition.

The embodiment of the application provides a voice recognition method, which comprises the following steps:

Acquiring target voice to be recognized;

Extracting acoustic features of the target voice;

According to the acoustic characteristics of the target voice, performing first recognition on the target voice according to the preset word boundary length to obtain a first recognition result; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is longer than the preset word boundary length;

and determining a final recognition result of the target voice according to the first recognition result and the second recognition result.

In a possible implementation manner, according to the acoustic characteristics of the target voice, performing first recognition on the target voice according to a preset word boundary length to obtain a first recognition result; and performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the second recognition result comprises:

According to the acoustic characteristics of the target voice, calculating a first target phoneme corresponding to the target voice according to a preset word boundary length; identifying the target voice according to the first target phonemes to obtain a first identification result;

according to the acoustic characteristics of the target voice, calculating a second target phoneme corresponding to the target voice according to a preset window length; and identifying the target voice according to the second target phoneme to obtain a second identification result.

In a possible implementation manner, the determining the final recognition result of the target voice according to the first recognition result and the second recognition result includes:

Calculating the distance between the second recognition result and the first recognition result;

When the distance is larger than a preset distance threshold, the second recognition result is used as a final recognition result of the target voice; and when the distance is not greater than the preset distance threshold, taking the first recognition result as a final recognition result of the target voice.

inputting the acoustic characteristics of the target voice into a pre-constructed voice recognition model, and recognizing the target voice to obtain a first recognition result and a second recognition result;

The speech recognition model comprises an input layer, a deep convolution layer, a word boundary prediction network and a decoding layer.

In a possible implementation manner, the decoding layers include a real-time decoding layer and a full-view decoding layer; inputting the acoustic characteristics of the target voice into a pre-constructed voice recognition model, and recognizing the target voice to obtain a first recognition result and a second recognition result, wherein the method comprises the following steps:

Sequentially inputting the acoustic features of the target voice to the deep convolution layer through the input layer; the acoustic features of the target voice are encoded by utilizing the depth convolution layer, and a voice encoding result is obtained;

Predicting the voice coding result by using the word boundary prediction network, or predicting the acoustic characteristics of the target voice by using the word boundary prediction network to obtain the preset word boundary length;

decoding the voice coding result by utilizing the real-time decoding layer according to the preset word boundary length to obtain a first recognition result of the target voice; and decoding the voice coding result by utilizing the full-view decoding layer according to the preset window length of the full view to obtain a second recognition result of the target voice.

In a possible implementation manner, the voice recognition model is constructed as follows:

Acquiring sample voice;

extracting acoustic features of the sample speech;

Training an initial voice recognition model according to the acoustic characteristics of the sample voice and the text recognition label corresponding to the sample voice, and generating the voice recognition model.

In a possible implementation manner, the method further includes:

Acquiring verification voice;

Extracting acoustic features of the verification speech;

inputting the acoustic characteristics of the verification voice into the voice recognition model to obtain a text recognition result of the verification voice;

And when the text recognition result of the verification voice is inconsistent with the text marking result corresponding to the verification voice, the verification voice is taken as the sample voice again, and the voice recognition model is updated.

The embodiment of the application also provides a voice recognition device, which comprises:

the first acquisition unit is used for acquiring target voice to be recognized;

a first extraction unit for extracting acoustic features of the target voice;

The recognition unit is used for carrying out first recognition on the target voice according to the acoustic characteristics of the target voice and the preset word boundary length to obtain a first recognition result; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is longer than the preset word boundary length;

And the determining unit is used for determining a final recognition result of the target voice according to the first recognition result and the second recognition result.

In a possible implementation manner, the identifying unit is specifically configured to:

In a possible implementation manner, the determining unit includes:

A calculating subunit, configured to calculate a distance between the second recognition result and the first recognition result;

A determining subunit, configured to take the second recognition result as a final recognition result of the target voice when the distance is greater than a preset distance threshold; and when the distance is not greater than the preset distance threshold, taking the first recognition result as a final recognition result of the target voice.

In a possible implementation manner, the decoding layers include a real-time decoding layer and a full-view decoding layer; the identification unit includes:

An encoding subunit, configured to sequentially input, through the input layer, acoustic features of the target speech to the deep convolutional layer; the acoustic features of the target voice are encoded by utilizing the depth convolution layer, and a voice encoding result is obtained;

The prediction subunit is used for predicting the voice coding result by using the word boundary prediction network or predicting the acoustic characteristics of the target voice by using the word boundary prediction network to obtain the preset word boundary length;

A decoding subunit, configured to decode the speech coding result according to the preset word boundary length by using the real-time decoding layer, so as to obtain a first recognition result of the target speech; and decoding the voice coding result by utilizing the full-view decoding layer according to the preset window length of the full view to obtain a second recognition result of the target voice.

In a possible implementation manner, the apparatus further includes:

The second acquisition unit is used for acquiring sample voice;

A second extraction unit for extracting acoustic features of the sample speech;

the training unit is used for training the initial voice recognition model according to the acoustic characteristics of the sample voice and the text recognition labels corresponding to the sample voice, and generating the voice recognition model.

In a possible implementation manner, the apparatus further includes:

A third acquisition unit configured to acquire a verification voice;

a third extraction unit for extracting acoustic features of the verification speech;

the obtaining unit is used for inputting the acoustic characteristics of the verification voice into the voice recognition model to obtain a text recognition result of the verification voice;

And the updating unit is used for re-using the verification voice as the sample voice and updating the voice recognition model when the text recognition result of the verification voice is inconsistent with the text marking result corresponding to the verification voice.

The embodiment of the application also provides voice recognition equipment, which comprises the following steps: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

The memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform any of the implementations of the speech recognition method described above.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions which, when run on terminal equipment, cause the terminal equipment to execute any implementation mode of the voice recognition method.

The embodiment of the application also provides a computer program product which, when run on terminal equipment, causes the terminal equipment to execute any implementation mode of the voice recognition method.

The embodiment of the application provides a voice recognition method, a device, a storage medium and equipment, wherein the voice recognition method, the device, the storage medium and the equipment are characterized in that firstly, target voice to be recognized is obtained, the acoustic characteristics of the target voice are extracted, and then, according to the acoustic characteristics of the target voice, the target voice is subjected to first recognition according to the preset word boundary length to obtain a first recognition result; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is longer than the preset word boundary length; then, a final recognition result of the target voice may be determined according to the first recognition result and the second recognition result. Therefore, in the embodiment of the application, the target voice is respectively identified according to the preset word boundary length and the larger preset window length, and the two identification results obtained through synthesis are used for determining the final voice identification result, so that the real-time identification is considered, the identification basis of the target voice is enriched, the delay of the identification result is reduced, and the accuracy of the identification result is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a voice recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a speech recognition model according to an embodiment of the present application;

Fig. 3 is a schematic structural diagram of a small-delay flow branch in a real-time decoding layer according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of constructing a speech recognition model according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a method for verifying a speech recognition model according to an embodiment of the present application;

fig. 6 is a schematic diagram of a voice recognition device according to an embodiment of the present application.

Detailed Description

The existing speech recognition method generally adopts a forced window cutting mode to perform speech recognition, namely, the existing recognition model is started to perform recognition when a phrase sound fragment with a fixed length is collected or buffered to achieve the streaming presentation effect of the recognition result, but in order to achieve the streaming presentation effect, the window length of the forced window cutting adopted in actual use is generally less than 600ms, which directly causes the problems of model training and mismatching when the model is used. This is because the recognition model often has a deeper hierarchical structure, and the recognition model inputs a complete sentence of complete speech during training (which can be considered as full-field training), for each frame of speech recognition, the field of view of the complete sentence of speech before and after it can be theoretically seen, while in the test stage of the recognition model, due to the recognition mode of forced window cutting, the recognition is not full-field recognition, that is, only the speech before the current speech frame can be seen, the speech behind can not be seen, so that the future field of view is limited, and the inaccuracy of the recognition result will be caused by the mismatch of the training and test stages of the model.

At present, in order to solve the problem of inaccurate recognition results caused by mismatching of model training and testing stages, a matching training mode can be adopted, but convolutional layer calculation and attention layer calculation of the model can be linearly overlapped along with network deepening, and accordingly, matching training on a model structure is not facilitated for a network with more network layers, but data blocks with a sentence divided into a plurality of window lengths are respectively calculated on the network layer, and training calculation amount and training duration are doubled in this way. The problem of inaccurate recognition results caused by mismatching of training and use of the current model cannot be effectively solved.

In order to solve the above-mentioned drawbacks, the present application provides a voice recognition method, which includes firstly obtaining a target voice to be recognized, extracting acoustic features of the target voice, and then performing a first recognition on the target voice according to the acoustic features of the target voice and a preset word boundary length to obtain a first recognition result; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is longer than the preset word boundary length; then, a final recognition result of the target voice may be determined according to the first recognition result and the second recognition result. Therefore, in the embodiment of the application, the target voice is respectively identified according to the preset word boundary length and the larger preset window length, and the two identification results obtained through synthesis are used for determining the final voice identification result, so that the real-time identification is considered, the identification basis of the target voice is enriched, the delay of the identification result is reduced, and the accuracy of the identification result is improved.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

First embodiment

Referring to fig. 1, a flow chart of a voice recognition method provided in this embodiment includes the following steps:

s101: and acquiring target voice to be recognized.

In this embodiment, any voice that is subjected to voice recognition by using this embodiment is defined as a target voice. Moreover, the present embodiment does not limit the language type of the target voice, for example, the target voice may be a chinese voice, an english voice, or the like; meanwhile, the length of the target voice is not limited in this embodiment, for example, the target voice may be one sentence, or multiple sentences.

It can be understood that the target voice can be obtained through a recording mode according to actual needs, for example, phone call voice in daily life of people or conference recording can be used as the target voice, and after the target voice is obtained, the target voice can be identified by using the scheme provided by the embodiment.

S102: and extracting the acoustic characteristics of the target voice.

In this embodiment, after the target voice to be recognized is obtained through step S101, in order to accurately recognize the text information corresponding to the target voice, a feature extraction method is required to extract the acoustic feature of the target voice, and the acoustic feature is used as a recognition basis to implement effective recognition of the target voice through subsequent steps S103-S104.

Specifically, when the acoustic characteristics of the target voice are extracted, firstly, framing processing is required to be carried out on the target voice to obtain a corresponding voice frame sequence, and then pre-emphasis is carried out on the framed voice frame sequence; and sequentially extracting the acoustic features of each voice frame, wherein the acoustic features refer to feature data for representing acoustic information of the corresponding voice frame, for example, mel-frequency coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) features, perceptual linear prediction (Perceptual LINEAR PREDICTIVE, PLP) features and the like.

It should be noted that, the embodiment of the present application is not limited to the extraction method of the acoustic feature of the target voice, nor is it limited to a specific extraction process, and an appropriate extraction method may be selected according to the actual situation, and the corresponding feature extraction operation may be performed.

S103: according to the acoustic characteristics of the target voice, performing first recognition on the target voice according to the preset word boundary length to obtain a first recognition result; and performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is longer than the preset word boundary length.

In this embodiment, after the acoustic feature of the target voice is extracted in step S102, in order to reduce the delay of the recognition result, further, the acoustic feature of the target voice may be processed according to a preset minimum window length (i.e., a word boundary length far lower than the window length of the currently adopted forced window), so as to determine the recognition result of the target voice according to the processing result, which is defined herein as the first recognition result. Thus, compared with the existing recognition mode adopting a forced window length (such as 200ms, which is far higher than the word boundary length used for representing the length of one word), the time delay of the recognition result can be effectively reduced.

The word boundary length refers to a predetermined window length which makes the recognition result delay smaller. The preferred mode can be determined by the word boundary prediction network introduced in the subsequent steps so as to more conform to the length of the target voice, reduce the recognition delay, or can be determined by other modes according to the actual situation, and the embodiment of the application is not limited to this. The word boundary length set in the present application is not limited to one word length, and one voice recognition may be performed for a plurality of words.

Specifically, after the acoustic feature of the target voice is extracted, in order to effectively reduce the delay of the recognition result, each minimum voice unit (i.e., a phoneme, which refers to a minimum unit or a minimum voice segment that forms each syllable of the target voice) that forms the target voice may be calculated according to a preset word boundary length, and is defined herein as a first target phoneme, and then the first target phoneme is analyzed to implement recognition of the target voice according to the obtained processing result, so as to obtain a first recognition result with smaller delay.

In a preferred implementation manner, after the first target phoneme of the target speech is calculated, an existing or future decoding method (such as Viterbi algorithm) may be further utilized to decode the first target phoneme, so as to determine a decoding result corresponding to the target speech, which is used as the first recognition result. The specific decoding process is consistent with the existing method, and will not be described in detail herein.

Meanwhile, in order to improve accuracy of the recognition result of the target voice, after the acoustic feature of the target voice is extracted in step S102, recognition is performed by adopting a recognition mode of forced window cutting, that is, the acoustic feature of the target voice is processed according to a preset window length (for example, 1S) greater than the word boundary length, so as to determine the recognition result of the target voice according to the processing result, which is defined as a second recognition result. In this way, when the recognition is carried out, the acoustic characteristic information of the voice in the context in the longer window length is fully considered, and the accuracy of the recognition result can be effectively improved although the delay of the recognition result is higher.

The preset window length refers to a preset window length which enables the accuracy of decoding results to be high. It should be noted that, in order to obtain a more accurate recognition result, the length of the preset window length is generally set to be larger than the word boundary length, for example, 1s, that is, the target voice is recognized once every 1s, so that a recognition basis richer in a large field of view is used to obtain a recognition result with higher accuracy.

Specifically, after extracting the acoustic features of the target voice, in order to improve the accuracy of the recognition result, each minimum voice unit (i.e., phoneme) that forms the target voice may be calculated according to a preset window length greater than the word boundary length, which is defined herein as a second target phoneme, and then the second target phoneme is analyzed to implement recognition of the target voice according to the obtained processing result, so as to obtain a second recognition result with higher accuracy.

In a preferred implementation manner, after the second target phoneme of the target speech is calculated, an existing or future decoding method (such as Viterbi algorithm) may be further utilized to decode the second target phoneme, so as to determine a decoding result corresponding to the target speech, which is used as the second recognition result. The specific decoding process is consistent with the existing method, and will not be described in detail herein.

S104: and determining a final recognition result of the target voice according to the first recognition result and the second recognition result.

In this embodiment, after the first recognition result with smaller delay and the second recognition result with higher accuracy corresponding to the target voice are obtained through step S103, the distance between the first recognition result and the second recognition result can be further calculated to determine the degree of distinction between the first recognition result and the second recognition result, then the second recognition result is utilized to make up for the defect of the first recognition result in accuracy according to the degree of distinction between the first recognition result and the second recognition result, and the two recognition results are comprehensively processed to determine the recognition result with smaller delay and higher accuracy as the final recognition result of the target voice.

Specifically, an alternative implementation manner, the specific implementation procedure of the step S104 may include the following steps a-B:

Step A: and calculating the distance between the second recognition result and the first recognition result.

In this implementation manner, after the first recognition result with smaller delay and the second recognition result with higher accuracy corresponding to the target voice are calculated in step S103, the distance between the first recognition result and the second recognition result can be further calculated, so as to execute the subsequent step B. For example, the edit distance between the second recognition result and the first recognition result may be calculated according to the edit distance criterion for performing the subsequent step B.

Illustrating: assuming that the first recognition result is "i love singing" and the second recognition result is "i love dancing", the number of edits required to make the second recognition result the same as the first recognition result is 2, that is, the number of edits required to make the "i love dancing" the same as "i love singing" is 2 may be calculated according to the edit distance criterion.

It will be appreciated that if the number of edits required to make the second recognition result identical to the first recognition result is 0, it is indicated that the second recognition result is identical to the first recognition result.

And (B) step (B): when the editing distance is greater than a preset distance threshold, taking the second recognition result as a final recognition result of the target voice; and when the distance is not greater than the preset distance threshold value, taking the first recognition result as a final recognition result of the target voice.

It should be noted that, because the first recognition result is a recognition result obtained by processing the acoustic feature of the target voice according to the preset word boundary length, the real-time performance of the first recognition result is stronger and the delay is smaller; accordingly, since the second recognition result is a recognition result obtained by processing the acoustic feature of the target voice according to a preset window length (i.e., a large field of view, for example, 1 s) greater than the word boundary length, it is explained that the recognition basis of the second recognition result is richer, and the accuracy is higher although the delay of the second recognition result is greater than that of the first recognition result.

Therefore, in order to reduce the delay of the recognition result and improve the accuracy of the recognition result when recognizing the target voice. After the distance (such as the editing distance) between the second recognition result and the first recognition result of the target voice is calculated in the step a, whether the distance is larger than the preset distance threshold value or not can be further judged, if yes, the recognition accuracy of the first recognition result is too low to meet the requirement of the user, and at the moment, even if the time delay is smaller, the second recognition result with higher accuracy cannot be used as the recognition result of the target voice for improving the user experience, but the second recognition result with higher accuracy should be used as the final recognition result of the target voice. Otherwise, if the distance between the second recognition result and the first recognition result is not greater than the preset distance threshold, the method indicates that although the accuracy of the first recognition result is not higher than that of the second recognition result, the method can also meet the requirement of a user, and in order to reduce the delay of the recognition result, the streaming presentation effect is achieved, and the first recognition result with smaller delay can be used as the final recognition result of the target voice.

It should be noted that, the specific value of the preset distance threshold may be set according to the actual situation, which is not limited in the embodiment of the present application, for example, taking the distance as the edit distance as an example, the preset distance threshold is the preset edit distance threshold, and may be set to 3.

It should be further noted that, the specific implementation process of the steps S103 and S104 is not limited, and may be implemented by a speech recognition model described later, or other implementation manners may be selected according to actual situations. The following description will take, as an example, a recognition process for implementing the steps S103 and S104 by using a pre-constructed speech recognition model:

Specifically, in the step S103, according to the acoustic characteristics of the target voice, the target voice is first identified according to the preset word boundary length, so as to obtain a first identification result; and performing second recognition on the target voice according to the preset window length, and obtaining a second recognition result "may include: and inputting the acoustic characteristics of the target voice into a pre-constructed voice recognition model, and recognizing the target voice to obtain a first recognition result and a second recognition result. Wherein referring to fig. 2, the speech recognition model may include an input layer, a depth convolution layer, a word boundary prediction network, and a decoding layer (including a real-time decoding layer and a full-view decoding layer, as shown in fig. 2). The specific implementation process of recognizing the target voice by using the voice recognition model can comprise the following steps of C1-C3:

Step C1: sequentially inputting the acoustic features of the target voice to the deep convolution layer through the input layer; and the acoustic features of the target voice are encoded by utilizing the depth convolution layer, so that a voice encoding result is obtained.

In this implementation manner, after extracting the acoustic feature (e.g., MCFF) vector of the target voice, the acoustic feature vector of the target voice may be further segmented into independent data blocks by the input layer according to a fixed window length and window movement, and sequentially input into the deep convolution layer (such as the deep convolution network shown in fig. 2) in time sequence, and the acoustic feature of the target voice "i love beijing Tiananmen" may be segmented according to a window length and window movement of 200ms, so as to obtain each independent data block, and sequentially input into the deep convolution layer in time sequence.

Further, the acoustic features of the target voice are sequentially input to the deep convolution layer, and each frame of the target voice is encoded to obtain a voice encoding result (namely, a higher-order voice feature representation vector corresponding to the target voice). Wherein the deep convolution layer is composed of a deep convolution network structure. For example, the system may be composed of a Long Short-Term Memory (LSTM) network, where the network structure may use a 3*2 convolution kernel to ensure that the field of view of the recognition model only sees the historical speech frame of the current target speech, and no information of future target speech frames is needed, so as to ensure that the deep convolution layer has no delay in encoding. The specific convolutional encoding process is the same as the prior art and will not be described in detail here.

Step C2: and predicting a voice coding result by using a word boundary prediction network, or predicting the acoustic characteristics of the target voice by using the word boundary prediction network to obtain a preset word boundary length.

In this implementation manner, the acoustic features of the target speech are sequentially input to the depth convolution layer through the input layer of the model, the acoustic features of the target speech are encoded by using the depth convolution layer, after a speech encoding result is obtained, the speech encoding result may be further input to the word boundary prediction network, so that the word boundary prediction network is used to predict the speech encoding result, and a preset word boundary length is obtained, so as to determine the window length and window movement adopted by the real-time decoding layer during decoding.

In particular, the word boundary prediction network may include a two-layer LSTM and a two-layer deep neural network (Deep Neural Networks, abbreviated DNN). After the speech coding result is obtained, the speech coding result can be input into a word boundary prediction network frame by frame in a mode of splicing front and rear multi-frames so as to contain certain context information and obtain the probability of whether each frame of speech has a word boundary or not, the speech is defined as alpha, the value of alpha is between 0 and 1, the higher the value is (the closer to 1) is, the greater the position of the frame is, the possible word jump boundary exists, the speech frame corresponding to the speech coding result in the two word boundaries can be determined by defining a preset probability threshold value for judging, and the following step C3 is executed as the preset word boundary length.

For example, as shown in fig. 2, assuming that the preset probability threshold is 0.7, when the speech coding result is inputted to the word boundary prediction network frame by frame, when the word boundary prediction network output probability α2 is 0.9 when the 3 rd frame is inputted, if the preset probability threshold is exceeded by 0.7, turning points of "me" and "love" may exist, and then 3 frames will be taken as the preset word boundary length.

Or the word boundary prediction network may be directly utilized to predict the acoustic characteristics of the target voice, so as to obtain the preset word boundary length, as indicated by the dashed arrow in fig. 2. Furthermore, the real-time decoding layer can be utilized to decode the voice coding result according to the preset word boundary length to obtain a first decoding result of the target voice, and the specific implementation process is similar to the above process, only the voice coding result is replaced by the acoustic feature of the target voice, and the details are not repeated here.

Step C3: decoding the voice coding result by using a real-time decoding layer according to the preset word boundary length to obtain a first recognition result of the target voice; and decoding the voice coding result by using the full-view decoding layer according to the preset window length of the full view to obtain a second recognition result of the target voice.

And C, after the preset word boundary length is determined, the voice coding result is further decoded by utilizing a real-time decoding layer according to the preset word boundary length, so that a decoding result is obtained, and the decoding result is used as a first recognition result of the target voice. For example: based on the example in step C2, the determined speech coding result of each data block may be input into a real-time decoding layer (i.e. the small-delay streaming real-time path shown in fig. 2) to perform phoneme recognition, and decoded according to the recognition result, to obtain a decoding result of the target speech "i love beijing Tiananmen", as the first recognition result.

It should be noted that, if the word skip boundary is not determined for a long time, in order to ensure the normal running of the target speech recognition process and the real-time performance of decoding, a maximum word boundary length may be preset, for example, 500ms, that is, when the word skip boundary is not detected within 500ms, the speech coding result corresponding to the maximum 500ms speech is forcedly input into the real-time decoding layer to perform phoneme recognition, and decoding is performed according to the recognition result, so as to obtain the decoding result, and the decoding result is used as the first decoding result of the target speech, so as to prevent the recognition result from being incorrect due to the abnormal condition of the word boundary prediction network failure caused by the complex scene such as noise when the speaker pronounces.

Meanwhile, when the length of the buffered or accumulated speech coding result meets the preset window length of a large field or a full field (for example, 1 s), the buffered or accumulated speech coding result with the preset window length can be input into a full field decoding layer (namely, the large field/full field branch shown in fig. 2), so that the speech coding result is decoded according to the preset window length of the full field by using the full field decoding layer, for example, the speech coding result of each 1s is identified once, so as to obtain the decoding result of the target speech, and the decoding result is used as the second identification result of the target speech.

In this way, the real-time decoding layer is utilized to decode the voice coding result according to the preset word boundary length, so that the instantaneity of the first decoding result of the obtained target voice is stronger and the delay is smaller. And the full-view decoding layer is utilized to decode the voice coding result according to the preset window length of the full view, so that the accuracy of the second decoding result of the target voice is higher. Therefore, in order to ensure that the stream recognition effect of the voice is presented, the first decoding result can be used as the preliminary recognition result of the target voice so as to ensure the real-time performance of decoding. But at the same time, the second recognition result is utilized to make up for the deficiency of the first recognition result in accuracy according to the distance between the second recognition result and the first recognition result, so as to determine the recognition result with smaller delay and higher accuracy as the final recognition result of the target voice.

For example, an edit distance between the second recognition result and the first recognition result (i.e., the number of edits required to make the second recognition result identical to the first recognition result) is taken as an example. If the number of edits required to make the second recognition result identical to the first recognition result is calculated to be 0, it is indicated that the second recognition result completely coincides with the first recognition result, and then the first recognition result may be regarded as the final recognition result of the target voice. If the number of edits required to make the second recognition result identical to the first recognition result is not 0, it is indicated that the second recognition result is not completely consistent with the first recognition result, then it is further necessary to determine whether the number of edits exceeds a preset number of times threshold, if yes, then it is necessary to replace the first decoding result as the preliminary recognition result with the second recognition result with higher accuracy at this time, as the final recognition result of the target voice, so as to solve the problem of low accuracy of the recognition result caused when the first decoding result is used as the preliminary recognition result of the target voice at this time. If not, the first recognition result can be used as the final recognition result of the target voice at the moment.

For example, as shown in fig. 2, assuming that the preset number of times threshold is 2, and the first recognition result is "me is next to the heaven", the second recognition result is "me is loved to the beijing", and the edit distance of both (i.e., the number of edits required when the second recognition result "me is loved to the beijing" is the same as the first recognition result "me is next to the heaven") is 3, exceeding the edit number threshold of 2, the second recognition result "me is loved to the beijing" instead of the first recognition result "me is next to the heaven" as the recognition result of the target voice.

Furthermore, an alternative implementation is that the large field/full field branch in the full field decoding layer in the speech recognition model described above may employ a conventional multi-headed attention structure. In order to solve the problem that the mismatching of the training and testing phases of the model may cause inaccurate recognition results, that is, in order to control the superposition effect of the visual field while increasing the network depth, the training and testing phases of the model are performed in a matching manner under a small visual field, and the small delay flow branch in the real-time decoding layer in the speech recognition model adopts a structure as shown in fig. 3. And splitting the future window into independent branches for calculation. The future window information of the first multi-head attention layer is cached and converted by the full-connection layer to be used as the input of the future window of the next multi-head attention layer, so that the cut-off of the future window visual field is achieved, and the network multi-layer superposition is performed, but the visual field is not linearly increased. And then ensuring that the field of view of the training and using process of the model is fully matched by ensuring that the total field of view length of the network is consistent with the future window field of view of the first multi-head attention layer.

Specifically, it should be noted that, in the conventional multi-head attention structure, the n-th multi-head attention neural network layer is input as X ⁿ, three vectors Q ⁿ、Kⁿ、Vⁿ, namely Q ⁿ(Query)、Kⁿ(Key)、Vⁿ (Value), are generated through the fully connected layer, and the physical meaning of the process simulation is as follows: traversing K ⁿ of adjacent time by taking the current time Q ⁿ as a Query to obtain the importance degree of each adjacent time to the current time, and weighting the adjacent time vector V ⁿ by the importance degree to obtain a new characterization vector containing context information at the time. In this embodiment, the inner product of each time Q ⁿ and the adjacent time vector K ⁿ within a certain window length is calculated to obtain the importance degree of the information contained in the voice frame at the adjacent time to the current time, and the importance degree is normalized by the softmax layer to obtain a set of softmax coefficients. And weighting the information V ⁿ in the window length by the coefficient to obtain a new characterization vector at the current moment. Typically the window length will be centered around the current frame and will include a length of the history window and future window speech frames, e.g., 31 frames each seen in history and in the future centered around the current window. However, since future windows will bring about model delays, and multilayer multi-headed attention stacks superimpose the future window hard delays linearly superimposed, removing the future window field of view will result in loss of effect.

Therefore, in order to realize the purposes that the network deepens but the vision is not increased and future windows are reserved to minimize the effect loss in model training, as shown in fig. 3, the application splits the future windows into independent branches for calculation, namely, the future window information of the first multi-head attention layer is cached, namely, K ¹、V¹, and is converted by the full-connection layer and then used as the input of the future windows of the next multi-head attention layer, thereby achieving the purpose that the vision of the future windows is cut off, and the network is overlapped in multiple layers but the vision is not linearly increased. The conventional attention formula is also adjusted from the following formula (1) to the following formula (2):

Attention(Qⁿ,Kⁿ,Vⁿ)＝soft max(Qⁿ,Kⁿ)×Vⁿ (1)

Attention(Qⁿ,Kⁿ,Vⁿ)＝soft max(Qⁿ,Kⁿ,K¹)×(Vⁿ,V¹) (2)

wherein, Q ⁿ、Kⁿ、Vⁿ represents Query, key, value token vector of the n-th multi-head attention layer, and K ¹ and V ¹ represent Key and Value token vector of the first multi-head attention layer, respectively.

It should be further noted that, for a specific construction process of the speech recognition model, reference may be made to the description of the second embodiment.

In summary, in the voice recognition method provided in this embodiment, first, a target voice to be recognized is obtained, an acoustic feature of the target voice is extracted, and then, according to the acoustic feature of the target voice, first recognition is performed on the target voice according to a preset word boundary length, so as to obtain a first recognition result; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is longer than the preset word boundary length; then, a final recognition result of the target voice may be determined according to the first recognition result and the second recognition result. Therefore, in the embodiment of the application, the target voice is respectively identified according to the preset word boundary length and the larger preset window length, and the two identification results obtained through synthesis are used for determining the final voice identification result, so that the real-time identification is considered, the identification basis of the target voice is enriched, the delay of the identification result is reduced, and the accuracy of the identification result is improved.

Second embodiment

This embodiment will explain the construction process of the speech recognition model mentioned in the above embodiment.

Referring to fig. 4, a schematic flow chart of constructing a speech recognition model according to the present embodiment is shown, where the flow includes the following steps:

s401: sample speech is acquired.

In this embodiment, in order to construct the speech recognition model, a large amount of preparation work needs to be performed in advance, first, a large amount of speech data sent by a user during speaking needs to be collected, for example, the speech data can be collected through a microphone array, a sound pickup device can be a tablet computer or an intelligent hardware device, such as an intelligent sound box, a television and an air conditioner, and usually at least thousands of hours of speech data needs to be collected and noise reduction processing is performed on the speech data, the speech data can cover various application scenes (such as a vehicle-mounted device and a home device), and then each piece of collected speech data of the user can be used as sample speech respectively, and meanwhile, text information corresponding to the sample speech is marked manually in advance to train the speech recognition model.

S402: acoustic features of the sample speech are extracted.

In this embodiment, after the sample speech is obtained through step S401, the sample speech cannot be directly used for training to generate the speech recognition model, but a method similar to the method for extracting the acoustic features of the target speech in step S102 of the first embodiment is required to replace the target speech pair with the sample speech, so that the acoustic features of each sample speech can be extracted.

S403: training the initial speech recognition model according to the acoustic characteristics of the sample speech and the text recognition labels corresponding to the sample speech, and generating a speech recognition model.

When the training is performed, the target voice in the first embodiment can be replaced by the sample voice obtained in the first embodiment, and the recognition result corresponding to the sample voice can be output according to the execution process in the first embodiment through the current initial voice recognition model.

Specifically, according to steps C1 to C3 in the first embodiment, after the acoustic features of the sample speech are extracted, the recognition result corresponding to the sample speech may be determined by the initial speech recognition model. And then, comparing the recognition result with manually marked text information corresponding to the sample voice, updating the model parameters according to the difference of the recognition result and the manually marked text information until a preset condition is met, for example, the preset training times are reached, updating of the model parameters is stopped, training of the voice recognition model is completed, and a trained voice recognition model is generated.

Through the embodiment, the voice recognition model can be generated according to sample voice training, and further, the generated voice recognition model can be verified by using verification voice. The specific verification process may include the following steps S501-S504:

step S501: and acquiring verification voice.

In this embodiment, in order to implement verification of the speech recognition model, verification speech is first acquired, where the verification speech refers to audio information that can be used to perform verification of the speech recognition model, and after the verification speech is acquired, the following step S502 may be performed continuously.

Step S502: the acoustic features of the verification speech are extracted.

After the verification voice is obtained in step S501, the verification voice cannot be directly used for the verification voice recognition model, but the acoustic features of the verification voice need to be extracted first, and then the obtained voice recognition model needs to be verified according to the acoustic features of the verification voice.

Step S503: and inputting the acoustic characteristics of the verification voice into a voice recognition model to obtain a text recognition result of the verification voice.

After the acoustic features of the verification voice are extracted in step S502, further, the acoustic features of the verification voice may be input into a voice recognition model to obtain a text recognition result of the verification voice, so as to execute the subsequent step S504.

Step S504: and when the text recognition result of the verification voice is inconsistent with the text marking result corresponding to the verification voice, the verification voice is taken as the sample voice again, and the voice recognition model is updated.

After the text recognition result of the verification voice is obtained in step S503, if the text recognition result of the verification voice is inconsistent with the manually marked text marking result corresponding to the verification voice, the verification voice can be re-used as the sample voice, and the parameter update can be performed on the voice recognition model.

Through the embodiment, the voice recognition model can be effectively verified by using the verification voice, and when the text recognition result of the verification voice is inconsistent with the manually marked text marking result corresponding to the verification voice, the voice recognition model can be timely adjusted and updated, so that the recognition precision and accuracy of the recognition model can be improved.

In summary, the speech recognition model trained by the embodiment can calculate the target phonemes corresponding to the target speech according to the preset word boundary length and the preset window length, and serve as a richer recognition basis to recognize text information corresponding to the target speech, so that when the target speech is recognized, the delay of a recognition result can be reduced, and the accuracy of the recognition result can be improved.

Third embodiment

The present embodiment will be described with reference to a voice recognition device, and for related content, reference is made to the above-mentioned method embodiment.

Referring to fig. 6, a schematic diagram of a voice recognition apparatus according to the present embodiment is provided, and the apparatus 600 includes:

a first obtaining unit 601, configured to obtain a target voice to be recognized;

a first extraction unit 602, configured to extract acoustic features of the target voice;

The recognition unit 603 is configured to perform first recognition on the target voice according to the acoustic feature of the target voice and a preset word boundary length, so as to obtain a first recognition result; performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the preset window length is longer than the preset word boundary length;

a determining unit 604, configured to determine a final recognition result of the target voice according to the first recognition result and the second recognition result.

In one implementation manner of this embodiment, the identifying unit 603 is specifically configured to:

In one implementation of the present embodiment, the determining unit 604 includes:

In one implementation of this embodiment, the decoding layers include a real-time decoding layer and a full-view decoding layer; the identifying unit 603 includes:

In one implementation of this embodiment, the apparatus further includes:

The second acquisition unit is used for acquiring sample voice;

A second extraction unit for extracting acoustic features of the sample speech;

In one implementation of this embodiment, the apparatus further includes:

A third acquisition unit configured to acquire a verification voice;

Further, the embodiment of the application also provides a voice recognition device, which comprises: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

Further, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions, and when the instructions run on the terminal equipment, the instructions cause the terminal equipment to execute any implementation method of the voice recognition method.

Further, the embodiment of the application also provides a computer program product, which when being run on a terminal device, causes the terminal device to execute any implementation method of the voice recognition method.

From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus necessary general purpose hardware platforms. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech recognition, comprising:

Acquiring target voice to be recognized;

Extracting acoustic features of the target voice;

And determining a final recognition result of the target voice according to the distance between the first recognition result and the second recognition result and a preset distance threshold.

2. The method according to claim 1, wherein the first recognition is performed on the target voice according to the acoustic characteristics of the target voice and the preset word boundary length to obtain a first recognition result; and performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the second recognition result comprises:

3. The method according to claim 2, wherein the determining the final recognition result of the target voice according to the distance between the first recognition result and the second recognition result and a preset distance threshold value includes:

4. A method according to any one of claims 1-3, wherein the first recognition is performed on the target speech according to the acoustic characteristics of the target speech and a preset word boundary length, so as to obtain a first recognition result; and performing second recognition on the target voice according to a preset window length to obtain a second recognition result, wherein the second recognition result comprises:

5. The method of claim 4, wherein the decoding layers comprise a real-time decoding layer and a full-view decoding layer; inputting the acoustic characteristics of the target voice into a pre-constructed voice recognition model, and recognizing the target voice to obtain a first recognition result and a second recognition result, wherein the method comprises the following steps:

6. The method of claim 5, wherein the speech recognition model is constructed as follows:

Acquiring sample voice;

extracting acoustic features of the sample speech;

7. The method of claim 6, wherein the method further comprises:

Acquiring verification voice;

Extracting acoustic features of the verification speech;

8. A speech recognition apparatus, comprising:

the first acquisition unit is used for acquiring target voice to be recognized;

a first extraction unit for extracting acoustic features of the target voice;

and the determining unit is used for determining the final recognition result of the target voice according to the distance between the first recognition result and the second recognition result and a preset distance threshold value.

9. A speech recognition device, comprising: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

The memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein instructions, which when run on a terminal device, cause the terminal device to perform the method of any of claims 1-7.

11. A computer program product, characterized in that the computer program product, when run on a terminal device, causes the terminal device to perform the method of any of claims 1-7.