WO2020073509A1

WO2020073509A1 - Neural network-based speech recognition method, terminal device, and medium

Info

Publication number: WO2020073509A1
Application number: PCT/CN2018/124306
Authority: WO
Inventors: 王义文; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-10-11
Filing date: 2018-12-27
Publication date: 2020-04-16
Also published as: CN109559735A; CN109559735B

Abstract

The present application is applicable to the technical field of artificial intelligence, and provides a neural network-based speech recognition method, a terminal device, and a medium. The neural network-based speech recognition method comprises: obtaining a speech sequence to be recognized, and dividing the speech sequence into at least two frames of speech segments; performing acoustic feature extraction on each speech segment to obtain a feature vector of the speech segment; determining a first probability vector of the speech segment on a probability calculation layer of a preset neural network model on the basis of the feature vector of the speech segment, the value of each element in the first probability vector being used for identifying a probability that the pronunciation of the speech segment is a preset phoneme corresponding to the element; and determining a text sequence corresponding to the speech sequence on a joint timing classification layer of the preset neural network model on the basis of the first probability vectors of all the speech segments. Therefore, time costs and labor costs of speech recognition are saved.

Description

Speech recognition method, terminal equipment and medium based on neural network

This application declares that it enjoys the priority of the Chinese patent application filed on October 11, 2018 with the application number 201811182186.1 and titled "A Neural Network-based Speech Recognition Method, Terminal Equipment and Medium". The overall content of the Chinese patent application is The way of reference is incorporated in this application.

Technical field

The present application belongs to the field of artificial intelligence technology, and particularly relates to a neural network-based speech recognition method, terminal device, and non-volatile readable storage medium.

Background technique

Speech recognition is the process of converting a speech sequence into a text sequence. With the rapid development of artificial intelligence technology, speech recognition models based on machine learning are widely used in various speech recognition scenarios.

However, when training a traditional machine learning-based speech recognition model, for each frame of speech data in the speech sequence to be recognized, the corresponding pronunciation phonemes need to be known in advance in order to effectively train the speech recognition model. Before training the speech recognition model, the speech sequence and the text sequence should be frame aligned. The sample data used to train the model is relatively large. It takes a lot of manpower and time to perform frame alignment processing on the speech sequence and the text sequence contained in each sample data, and the labor cost and time cost are relatively high.

technical problem

Embodiments of the present application provide a neural network-based voice recognition method, terminal device, and non-volatile readable storage medium to solve the labor cost and time cost of the existing voice recognition method based on the traditional voice recognition model Higher issues.

Technical solution

A first aspect of the embodiments of the present application provides a speech recognition method based on a neural network, including:

Obtain the speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;

Performing acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;

The probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;

In the joint time-series classification layer of the preset neural network model, a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.

A second aspect of the embodiments of the present application provides a terminal device, including:

The first dividing unit is used to obtain a speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;

A feature extraction unit, configured to perform acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;

The first determining unit is used to determine the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model; the value of each element in the first probability vector The probability for identifying the pronunciation of the speech segment as the preset phoneme corresponding to the element;

The second determining unit is used to determine the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments in the joint time-series classification layer of the preset neural network model.

A third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor executes the computer The following steps are realized when the instructions can be read:

A fourth aspect of the embodiments of the present application provides a non-volatile readable storage medium, where the non-volatile readable storage medium stores computer-readable instructions that are implemented when executed by a processor The following steps:

Beneficial effect

A method for speech recognition based on a neural network provided by an embodiment of the present application, by dividing the speech sequence to be recognized into at least two frames of speech segments, extracting the feature vector of each frame of speech segments; at the probability calculation layer of the preset neural network model Determine the first probability vector of the speech segment based on the feature vector of the speech segment; determine the text sequence corresponding to the speech sequence to be recognized in the joint timing classification layer of the preset neural network model based on the first probability vector of all speech segments. Assume that the joint time series classification layer in the neural network model can directly determine the text sequence corresponding to the speech sequence to be recognized based on the first probability vectors of all the speech fragments corresponding to the speech sequence to be recognized, so the preset in this embodiment When the neural network model is trained, it is not necessary to perform frame alignment processing on the speech sequence and the text sequence in the sample data used for model training, thereby saving the time cost and labor cost of speech recognition.

BRIEF DESCRIPTION

FIG. 1 is an implementation flowchart of a neural network-based speech recognition method provided by the first embodiment of the present application;

2 is a specific implementation flowchart of S13 in a neural network-based speech recognition method provided by a second embodiment of the present application;

3 is a specific implementation flowchart of S14 in a speech recognition method based on a neural network according to a third embodiment of the present application;

4 is a flowchart of a neural network-based speech recognition method provided by a fourth embodiment of the present application;

5 is a structural block diagram of a terminal device provided by an embodiment of the present application;

6 is a structural block diagram of a terminal device according to another embodiment of the present application.

Embodiments of the invention

In order to make the purpose, technical solutions and advantages of the present application more clear, the following describes the present application in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.

Please refer to FIG. 1, which is an implementation flowchart of a neural network-based speech recognition method provided in the first embodiment of the present application. In this embodiment, the execution subject of the speech recognition method based on the neural network is a terminal device. Terminal devices include but are not limited to smartphones, tablets or desktop computers.

The speech recognition method based on neural network shown in Figure 1 includes the following steps:

S11: Acquire a speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments.

The voice sequence refers to a piece of voice data whose duration is greater than a preset duration threshold, where the preset duration threshold is greater than zero. The speech sequence to be recognized is a speech sequence that needs to be translated into a text sequence.

When a voice sequence needs to be translated into a corresponding text sequence, a voice recognition request for the voice sequence can be triggered on the terminal device, and the voice recognition request carries the voice sequence to be recognized. The voice recognition request is used to request the terminal device to translate the voice sequence to be recognized into a corresponding text sequence. Exemplarily, when a user chats with a contact through an instant messaging application installed on the terminal device, if a voice sequence sent by the opposite end is received, the user can translate the voice sequence into a corresponding text sequence when needed For users to view. Specifically, the user can trigger the terminal device to display a menu bar for the voice sequence by long-pressing or right-clicking the voice display icon corresponding to the voice sequence, and triggering the Voice recognition request for voice sequence.

When detecting a voice recognition request for a certain voice sequence, the terminal device extracts the voice sequence to be recognized from the voice recognition request, and divides the extracted voice sequence into at least two frames of voice segments.

As an embodiment of the present application, the terminal device may divide the voice sequence to be recognized into at least two frames of voice segments in the following manner, that is, S11 may specifically include the following steps:

Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.

In this embodiment, the preset frame length is used to identify the duration of each frame of speech fragments obtained after framing the speech sequence; the preset frame shift amount is used to identify the timing step when the speech sequence is subjected to the minute hand operation.

After the terminal device obtains the speech sequence to be recognized, starting from the starting time point of the speech sequence to be recognized, a section of speech with a preset frame length is intercepted from the speech sequence to be recognized every preset frame shift, and then the The speech sequence to be recognized is divided into at least two frames of speech segments, and the duration of each frame of speech segments obtained by performing the framing operation on the recognized speech sequence is the preset frame length, and the starting time point of every two adjacent frames of speech Preset the frame shift amount at intervals.

It should be noted that, in the embodiment of the present application, the preset frame shift amount is smaller than the preset frame length, that is, there is a certain overlap in timing between every two adjacent frames of speech segments, and the duration of the overlapped portion is It is the difference between the preset frame length and the preset frame shift amount. In practical applications, both the preset frame length and the preset frame shift amount can be set according to actual needs. Exemplarily, if the preset frame length is set to 25 milliseconds and the preset frame shift amount is set to 10 milliseconds, then the terminal device obtains each frame after the voice sequence is framed based on the preset frame length and the preset frame shift amount There is an overlap of 25-10 = 15 milliseconds between two adjacent speech frames.

S12: Acoustic feature extraction is performed on the speech segment to obtain a feature vector of the speech segment.

In the embodiment of the present application, since each frame of speech fragments obtained after performing the framing operation on the speech sequence to be recognized has almost no ability to describe the speech fragment in the time domain, the terminal device divides the speech sequence to be recognized into at least After two frames of speech segments, the acoustic feature extraction is performed on each frame of speech segments based on a preset feature extraction network to obtain the feature vector of each frame of speech segments. The feature vector of the speech segment contains the acoustic feature information of the speech segment.

The preset neural network can be set according to actual needs, and there is no limitation here. For example, as an embodiment of this application, the preset feature extraction network may be a Mel Frequency Cepstral Coefficient (MFCC) feature extraction network. It should be noted that, since the MFCC feature extraction network is an existing technology, its principle will not be described in detail here.

S13: Determine the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model; the value of each element in the first probability vector is used to identify the The pronunciation of the speech segment is the probability of the preset phoneme corresponding to the element.

In the embodiment of the present application, after the terminal device determines the feature vectors of the speech segments of each frame obtained by framing operation of the speech sequence to be recognized, the feature vectors of all speech segments obtained by framing operation of the speech sequence to be recognized are imported Preset neural network model. The preset neural network model is obtained by training a pre-built original neural network model through a machine learning algorithm based on a preset number of sample data. Each piece of data in the sample data is composed of the feature vectors of all speech segments obtained by framing a speech sequence and the text sequence corresponding to the speech sequence.

The original neural network model includes successively connected probability calculation layers and joint time series classification layers. among them:

The probability calculation layer is used to determine the first probability vector of the speech segment based on the feature vector of the speech segment. The value of each element in the first probability vector is used to identify the probability that the speech segment is pronounced as the preset phoneme corresponding to the element. The number of elements contained in the first probability vector is the same as the number of preset phonemes. Among them, phonemes are the smallest phonetic units divided according to the natural attributes of speech. Phonemes usually include two major categories of vowel phonemes and consonant phonemes. The preset phonemes can be set according to actual needs, without limitation here. In the embodiment of the present application, the preset phoneme includes at least one blank phoneme. For example, the preset phoneme may include a blank phoneme and all vowel phonemes and consonant phonemes in Chinese pinyin. The joint timing classification layer is used to determine the text sequence corresponding to the speech sequence based on the first probability vectors of all speech segments obtained by framing the speech sequence to be recognized.

When training the original neural network model, the feature vectors of all the speech fragments obtained by framing the speech sequence contained in each sample data are used as the input of the original neural network model, and the speech contained in each sample data is used The text sequence corresponding to the sequence serves as the output of the original neural network model, trains the original neural network model, and determines the original neural network model that has completed the training as the preset neural network model in the embodiment of the present application. It should be noted that, during the training process of the original neural network model, the terminal device will learn the probability of the feature vectors of all speech segments appearing in the sample data relative to each preset phoneme in the probability calculation layer.

As an embodiment of the present application, after the terminal device imports the feature vectors of all speech segments obtained by framing the speech sequence to be recognized into the preset neural network model, it can determine each The first probability vector of the frame speech segment:

S131: In the probability calculation layer, determine the feature vectors of the at least two frames of speech segments relative to each of the preset phonemes based on the pre-learned probability of the feature vectors of each preset speech segment relative to each preset phoneme The probability.

S132: Determine a first probability vector of the speech segment based on the probability of the feature vector of the speech segment relative to each preset phoneme.

In this embodiment, the preset voice segments include all the voice segments that have appeared in the sample data, and the probability of the feature vectors of each preset voice segment learned in advance with respect to each preset phoneme is the appearance in the sample data learned in advance The probability of the feature vector of the passed speech segment relative to each preset phoneme.

After the terminal device imports the feature vectors of all speech segments obtained by framing the speech sequence to be recognized into the preset neural network model, the probability calculation layer of the preset neural network model is based on all possible speech segments learned in advance. The probability of the feature vector relative to each preset phoneme, the probability of the feature vector of each frame of the speech segment obtained by performing the framing operation of the speech sequence to be recognized relative to each preset phoneme, and the feature vector of each speech segment relative to each pre-set Let the phoneme probabilities constitute the first probability vector of the speech segment.

S14: Determine the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments in the joint time-series classification layer of the preset neural network model.

In the embodiment of the present application, after the terminal device determines the first probability vector of each frame of speech fragments obtained by performing the framing operation on the speech sequence to be recognized, based on the first The probability vector determines the text sequence corresponding to the speech sequence to be recognized.

In practical applications, assuming that the total number of preset phonemes is N, T frames of speech fragments are obtained after framing the speech sequence to be recognized, because the pronunciation phonemes corresponding to each frame of speech fragments may be any of the N preset phonemes eleven, therefore, to pronounce phoneme sequence corresponding to the speech sequence to be recognized a total of N ^T possible, in this embodiment, the kind of phoneme pronunciations N ^T sequence as a predetermined sequence of phoneme pronunciations, N ^T sequences kinds of phoneme pronunciations Each pronunciation phoneme sequence in is a sequence of length T composed of at least one phoneme in a preset phoneme.

Specifically, as an embodiment of the present application, S14 may be implemented through S141 to S143 shown in FIG. 3, and the details are as follows:

S141: Calculate the pronunciation phoneme probability vector of the speech sequence based on the first probability vector of all the speech segments and a preset probability calculation formula in the joint time-series classification layer of the preset neural network model; the pronunciation phoneme probability vector The value of each element in is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element, and the preset probability calculation formula is as follows:

among them,

Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N ^T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N ^T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y _it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment.

In this embodiment, after the terminal device determines the first probability vector of each frame of speech segments obtained by framing the speech sequence to be recognized, the terminal device classifies based on the speech sequence to be recognized in the joint timing classification layer of the preset neural network model The first probability vector of all speech segments obtained by the frame operation and the above-mentioned preset probability calculation formula calculate the pronunciation phoneme probability vector of the speech sequence to be recognized. The value of each element in the pronunciation phoneme probability vector of the speech sequence is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element.

Exemplarily, if the preset phonemes include the following 4 phonemes: a, i, o, and blank phonemes, 5-frame speech fragments are obtained after framing the speech sequence to be recognized, that is, T = 5, because 5 frame speech fragments The pronunciation phonemes corresponding to each speech segment in the frame can be any phoneme in the preset phonemes. Therefore, there are a total of 4 ⁵ = 1024 possible pronunciation phoneme sequences corresponding to the speech sequence to be recognized. Among the 1024 preset pronunciation phonemes In the sequence, assuming that the first preset pronunciation phoneme sequence is [a, a, i,-,-], then the a priori probability corresponding to the first pronunciation phoneme contained in the first preset pronunciation phoneme sequence is The probability of the feature vector of the first frame of speech segment determined at the probability calculation layer relative to the preset phoneme a, the prior probability corresponding to the second pronunciation phoneme included in the first preset pronunciation phoneme sequence is the probability The probability of the feature vector of the second frame of speech segment determined by the calculation layer relative to the preset phoneme a, the a priori probability corresponding to the third pronunciation phoneme included in the first preset pronunciation phoneme sequence is the probability calculation layer The determined third frame audio clip The probability of the feature vector of the segment relative to the preset phoneme i, the a priori probability corresponding to the 4th pronunciation phoneme included in the first preset pronunciation phoneme sequence is the fourth frame speech segment determined at the probability calculation layer The probability of the feature vector relative to the preset phoneme i, the a priori probability corresponding to the 5th pronunciation phoneme included in the first preset pronunciation phoneme sequence is the feature vector of the fifth frame speech segment determined at the probability calculation layer Relative to the blank phoneme-probability, the terminal device multiplies the prior probability corresponding to all elements in the first preset pronunciation phoneme sequence to obtain the pronunciation phoneme sequence corresponding to the speech sequence to be recognized is the first preset pronunciation phoneme sequence Probability, and so on, the probability that the pronunciation phoneme sequence corresponding to the speech sequence to be recognized is each preset pronunciation phoneme sequence, and the probability that the pronunciation phoneme sequence corresponding to the speech sequence to be recognized is each preset pronunciation phoneme sequence constitutes The phoneme probability vector of pronunciation sequence.

S142: Determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the text sequence corresponding to the speech sequence as the The probability of a preset text sequence corresponding to the element, which is obtained by compressing the preset pronunciation phoneme sequence.

In practical applications, the preset pronunciation phoneme sequence usually contains some blank phonemes, or some adjacent elements in the preset pronunciation phoneme sequence correspond to the same phoneme. Therefore, the terminal device determines the pronunciation corresponding to the speech sequence to be recognized After the phoneme sequence is the probability of each preset pronunciation phoneme sequence, each preset pronunciation phoneme sequence is compressed to obtain the text sequence corresponding to each preset pronunciation phoneme sequence, and then the pronunciation phoneme sequence corresponding to the speech sequence to be recognized The probability of each preset pronunciation phoneme sequence is converted to: the probability that the text sequence corresponding to the speech sequence to be recognized is the text sequence corresponding to each preset pronunciation phoneme sequence, that is, the pronunciation phoneme corresponding to the speech sequence to be recognized is The probability of each preset pronunciation phoneme sequence is determined as the probability that the text sequence corresponding to the speech sequence is the text sequence corresponding to each preset pronunciation phoneme sequence.

In this embodiment, the terminal device compressing the preset pronunciation phoneme sequence may specifically include: excluding blank phonemes in the preset pronunciation phoneme sequence, and at the same time, only one of the consecutive elements with the same value may be retained. For example, if a preset pronunciation phoneme sequence is [a, a, i,-,-], the text sequence obtained after compression processing is [a, i].

In practical applications, the text sequence obtained after the terminal device compresses different pronunciation phoneme sequences may be the same, for example, the text sequence obtained after the terminal device compresses the pronunciation phoneme sequence [a, a, i,-,-] Is [a, i], the text sequence obtained by compressing the phoneme sequence [a,-, i, i,-] is also [a, i], therefore, in the embodiment of the present application, when the pronunciation is preset At least two preset pronunciation phoneme sequences in the phoneme sequence have the same text sequence, then the terminal device sums the probability of the text sequence corresponding to the speech sequence to be recognized as the text sequence corresponding to the at least two preset pronunciation phoneme sequences After calculation, the probability that the text sequence corresponding to the speech sequence to be recognized is each preset text sequence is obtained. The preset text sequence is composed of all different text sequences obtained by compressing the preset phoneme sequence. The probability that the text sequence corresponding to the speech sequence to be recognized is each preset text sequence constitutes a text sequence probability vector corresponding to the speech sequence to be recognized.

S143: Determine the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.

The greater the value of the element in the text sequence probability vector, the greater the probability that the text sequence corresponding to the speech sequence to be recognized is the preset text sequence corresponding to the element. Therefore, in this embodiment, after determining the text sequence probability vector corresponding to the speech sequence to be recognized, the terminal determines the preset text sequence corresponding to the element with the largest value in the text sequence probability vector as the corresponding to the speech to be recognized. Text sequence.

It can be seen from the above that the feature vector of each speech segment is extracted by dividing the speech sequence to be recognized into at least two frames of speech segments; the probability calculation layer of the preset neural network model determines the speech segment based on the feature vector of the speech segment The first probability vector; the joint timing classification layer of the preset neural network model determines the text sequence corresponding to the speech sequence to be recognized based on the first probability vectors of all speech segments. Since the joint timing classification layer in the preset neural network model can Based on the first probability vectors of all speech segments corresponding to the speech sequence to be recognized, the text sequence corresponding to the speech sequence to be recognized is directly determined. Therefore, when training the preset neural network model in this embodiment, there is no need to use Both the speech sequence and the text sequence in the sample data of the model training are frame aligned, which saves the time cost and labor cost of speech recognition.

Please refer to FIG. 4. FIG. 4 is an implementation flowchart of a neural network-based speech recognition method according to a fourth embodiment of the present application. Compared with the embodiment corresponding to FIG. 1, a neural network-based speech recognition method provided in this embodiment may include S01 to S04 before S11, as described in detail as follows:

S01: Obtain a preset sample data set, and divide the sample data set into a training set and a test set; each sample data in the sample data set is composed of all speech fragments obtained by framing a speech sequence Is composed of the feature vector and the text sequence corresponding to the speech sequence.

Before translating the speech sequence to be recognized into the corresponding text sequence, the original neural network model needs to be constructed first. The original neural network includes successively connected probability calculation layers and joint time series classification layers. For the specific structure and principle of the probability calculation layer and the joint timing classification layer, please refer to the relevant description in S13 of the first embodiment, which will not be repeated here.

After constructing the original neural network model, the terminal device obtains a preset sample data set. Each sample data in the sample data set is composed of the feature vectors of all the speech fragments obtained by framing a speech sequence and the text sequence corresponding to the speech sequence.

After acquiring the preset sample data set, the terminal device may divide the sample data set into a training set and a test set based on the preset allocation ratio. The training set is used to train the original neural network model, and the test set is used to verify the accuracy of the trained original neural network model. The preset distribution ratio can be set according to actual needs, and is not limited here. For example, the preset distribution ratio may be: training set: test set = 3: 1. That is, 3/4 of the sample data in the sample data set is used to train the original neural network model, and 1/4 of the sample data is used to verify the accuracy of the trained original neural network model.

S02: Train a pre-built original neural network model based on the training set, and determine values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model.

In this embodiment, the terminal device trains the pre-built original neural network model based on the training set, and when training the original neural network model, all speeches obtained by framing the speech sequence contained in each sample data The feature vectors of the fragments are used as the input of the original neural network model, and the text sequence corresponding to the speech sequence contained in each sample data is used as the output of the original neural network model, and the feature vectors of all the speech fragments appearing in the sample data are learned in the probability calculation layer Relative to the probability of each preset phoneme, the training of the original neural network model is completed.

S03: verify the original neural network model that has completed training based on the test set.

In this embodiment, after the terminal device completes the training of the original neural network model based on the training set, the original neural network model that has completed the training is verified based on the test set.

Specifically, when verifying the original neural network model that has been trained based on the test set, the terminal device uses the feature vectors of all speech fragments obtained by framing the speech sequence contained in each sample data as the original neural network model , The original neural network model has been trained to determine the predicted value of the text sequence corresponding to the speech sequence in each sample data in the test set.

The terminal device calculates the prediction error of the trained original neural network model based on the text sequence corresponding to the speech sequence in each sample data in the test set and the prediction value of the text sequence corresponding to the speech sequence in each sample data. The prediction error of the original neural network model that has been trained is used to identify the accuracy of the speech recognition of the original neural network model that has been trained. .

In this embodiment, after obtaining the prediction error of the original neural network model that has completed training, the terminal device compares the prediction error of the original neural network model that has completed training with a preset error threshold, and determines the completed training based on the comparison result Verification results of the original neural network model. Wherein, the preset error threshold is the allowable error value of speech recognition accuracy in practical applications.

Where, if the comparison result is that the prediction error of the trained original neural network model is less than or equal to the preset error threshold, it means that the accuracy of speech recognition of the trained original neural network model is within the allowable error range, At this time, the terminal device determines the verification result of the original neural network model that has completed the training as verified; if the comparison result is that the prediction error of the original neural network model that has completed the training is greater than the preset error threshold, the original training has been completed The accuracy of the speech recognition of the neural network model exceeds the allowable error range. At this time, the terminal device determines the verification result of the original neural network model that has completed the training as a verification failure.

S04: If the verification is passed, the original neural network model that has completed training is determined to be the preset neural network model.

In an embodiment, if the terminal device detects that the verified original neural network model has passed the verification, it determines the trained original neural network model as the preset neural network model.

As can be seen from the above, this implementation trains the pre-built original neural network model through a training set containing a certain number of sample data, and trains the original neural network model vehicles that have completed training through a test set containing a certain number of sample data. The accuracy of the fixed loss is verified. After the verification is passed, the original neural network model that has been trained is used as the subsequent preset neural network model for speech recognition, thereby improving the accuracy of speech recognition.

Please refer to FIG. 5, which is a structural block diagram of a terminal device according to an embodiment of the present application. The terminal device in this embodiment is a terminal device. Each unit included in the terminal device is used to execute each step in the embodiments corresponding to FIGS. 1 to 4. For details, please refer to the related descriptions in FIGS. 1 to 4 and the embodiments corresponding to FIGS. 1 to 4. For ease of explanation, only parts related to this embodiment are shown. Referring to FIG. 5, the terminal device 500 includes: a first segmentation unit 51, a feature extraction unit 52, a first determination unit 53, and a second determination unit 54. among them:

The first division unit 51 is used to obtain a speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments.

The feature extraction unit 52 is configured to perform acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment.

The first determining unit 53 is used to determine the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model; the value of each element in the first probability vector It is used to identify the probability that the pronunciation of the speech segment is the preset phoneme corresponding to the element.

The second determining unit 54 is used to determine the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments in the joint time-series classification layer of the preset neural network model.

As an embodiment of the present application, the first dividing unit 51 is specifically used for:

As an embodiment of the present application, the first determination unit 53 includes a first probability determination unit and a second probability determination unit. among them:

The first probability determination unit is used to determine, in the probability calculation layer, the feature vectors of the at least two frames of speech segments relative to the respective The probability of the preset phoneme.

The second probability determination unit is used to determine the first probability vector of the speech segment based on the probability of the feature vector of the speech segment relative to each of the preset phonemes.

As an embodiment of the present application, the second determination unit 54 includes: a first calculation unit, a third probability determination unit, and a text sequence determination unit. among them:

The first calculation unit is used to calculate the pronunciation phoneme probability vector of the speech sequence based on the first probability vector and the preset probability calculation formula of all the speech segments in the joint time-series classification layer of the preset neural network model; The value of each element in the pronunciation phoneme probability vector is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element. The calculation formula of the preset probability is as follows:

among them,

Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N ^T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N ^T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y _it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment;

The third probability determination unit is used to determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the correspondence of the speech sequence Is a probability of a preset text sequence corresponding to the element, and the preset text sequence is obtained by compressing the preset pronunciation phoneme sequence.

The text sequence determining unit is configured to determine the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.

As an embodiment of the present application, the terminal device 500 may further include: a first acquisition unit, a model training unit, a model verification unit, and a model generation unit. among them:

The first obtaining unit is used to obtain a preset sample data set, and divide the sample data set into a training set and a test set; each sample data in the sample data set is obtained by framing a speech sequence The feature vectors of all the speech segments and the text sequence corresponding to the speech sequence.

The model training unit is used to train the pre-built original neural network model based on the training set, and determine the values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model.

The model verification unit is used to verify the original neural network model that has completed training based on the test set.

The model generating unit is configured to determine the original neural network model that has completed training as the preset neural network model if the verification is passed.

6 is a structural block diagram of a terminal device according to another embodiment of the present application. As shown in FIG. 6, the terminal device 6 of this embodiment includes: a processor 60, a memory 61, and computer-readable instructions 62 stored in the memory 61 and executable on the processor 60, for example, based on a neural network The program of the voice recognition method. When the processor 60 executes the computer-readable instructions 62, the steps in the embodiments of the above neural network-based speech recognition methods are implemented, for example, S11 to S14 shown in FIG. 1. Alternatively, when the processor 60 executes the computer-readable instructions 62, the functions of the units in the embodiment corresponding to FIG. 5 described above, for example, the functions of the units 51 to 54 shown in FIG. 5, please refer to the corresponding Relevant descriptions in the embodiments will not be repeated here.

Exemplarily, the computer-readable instructions 62 may be divided into one or more units, and the one or more units are stored in the memory 61 and executed by the processor 60 to complete the application . The one or more units may be a series of computer-readable instruction instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 62 in the terminal device 6. For example, the computer-readable instructions 62 may be divided into a first segmentation unit, a feature extraction unit, a first determination unit, and a second determination unit, and the specific functions of each unit are as described above.

The so-called processor 60 can be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or a memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, for example, a plug-in hard disk equipped on the terminal device 6, a smart memory card (Smart, Media, Card, SMC), and a secure digital (SD) Cards, flash cards, etc. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is used to store the computer-readable instructions and other programs and data required by the terminal device. The memory 61 can also be used to temporarily store data that has been or will be output.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the foregoing The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate from the spirit and scope of the technical solutions of the embodiments of the present application. Within the scope of protection of this application.

Claims

A speech recognition method based on neural network, which is characterized by:

Obtain the speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;

Performing acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;

The probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;

In the joint time-series classification layer of the preset neural network model, a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
The neural network-based speech recognition method according to claim 1, wherein the acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments include:

Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
The neural network-based speech recognition method according to claim 1, wherein the probability calculation layer in the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment ,include:

At the probability calculation layer, based on pre-learned probabilities of feature vectors of each preset speech segment relative to each preset phoneme, determine the probability of feature vectors of the at least two frames of speech segments relative to each preset phoneme, respectively ;

The first probability vector of the speech segment is determined based on the probability of the feature vector of the speech segment relative to each of the preset phonemes.
The neural network-based speech recognition method according to claim 1, wherein the joint time-series classification layer in the preset neural network model determines the speech based on the first probability vectors of all the speech segments The text sequence corresponding to the sequence, including:

In the joint time-series classification layer of the preset neural network model, based on the first probability vector of all the speech segments and a preset probability calculation formula, calculate the pronunciation phoneme probability vector of the speech sequence; in the pronunciation phoneme probability vector The value of each element is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element, and the preset probability calculation formula is as follows:

among them,
Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment;

Determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the text sequence corresponding to the speech sequence as the element correspondence The probability of the preset text sequence, the preset text sequence is obtained by compressing the preset pronunciation phoneme sequence;

Determining the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
The neural network-based speech recognition method according to any one of claims 1 to 4, wherein before the acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments, the method further includes:

Obtain a preset sample data set, and divide the sample data set into a training set and a test set; each sample data in the sample data set is characterized by all the speech segments obtained by framing a speech sequence Vector and the text sequence corresponding to the speech sequence;

Train the pre-built original neural network model based on the training set, and determine the values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model;

Verify the trained original neural network model based on the test set;

If the verification is passed, the original neural network model that has completed training is determined as the preset neural network model.
A terminal device is characterized by comprising:

The first dividing unit is used to obtain a speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;

A feature extraction unit, configured to perform acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;

The first determining unit is used to determine the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model; the value of each element in the first probability vector The probability for identifying the pronunciation of the speech segment as the preset phoneme corresponding to the element;

The second determining unit is used to determine the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments in the joint time-series classification layer of the preset neural network model.
The terminal device according to claim 6, wherein the first dividing unit is specifically configured to:

Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
The terminal device according to claim 6, wherein the first determining unit comprises:

A first probability determining unit for determining, in the probability calculation layer, based on the probabilities of the feature vectors of each preset voice segment learned in advance with respect to each preset phoneme, the feature vectors of the at least two frames of voice segments are determined relative to The probability of each of the preset phonemes;

The second probability determining unit is configured to determine a first probability vector of the speech segment based on the probability of the feature vector of the speech segment relative to each of the preset phonemes.
The terminal device according to claim 6, wherein the second determining unit comprises:

A first calculation unit, used to calculate the pronunciation phoneme probability vector of the speech sequence based on the first probability vector of all the speech segments and a preset probability calculation formula in the joint time-series classification layer of the preset neural network model; The value of each element in the pronunciation phoneme probability vector is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element. The calculation formula of the preset probability is as follows:

among them,
Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment;

A third probability determining unit, used to determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the speech sequence The corresponding text sequence is the probability of the preset text sequence corresponding to the element, and the preset text sequence is obtained by compressing the preset pronunciation phoneme sequence;

The text sequence determining unit is configured to determine the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
The terminal device according to any one of claims 6-9, further comprising:

The first obtaining unit is used to obtain a preset sample data set and divide the sample data set into a training set and a test set; each sample data in the sample data set is framed by a speech sequence The feature vectors of all the speech segments obtained and the text sequence corresponding to the speech sequence are composed;

A model training unit, configured to train the pre-built original neural network model based on the training set, and determine the values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model;

A model verification unit for verifying the original neural network model that has completed training based on the test set;

The model generating unit is configured to determine the original neural network model that has completed training as the preset neural network model if the verification is passed.
A terminal device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, characterized in that the processor is implemented as follows when executing the computer-readable instructions step:

Obtain the speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;

Performing acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;

The probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;

In the joint time-series classification layer of the preset neural network model, a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
The terminal device according to claim 11, wherein the acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments include:

Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
The terminal device according to claim 11, wherein the determining the first probability vector of the speech segment based on the feature vector of the speech segment in the probability calculation layer of the preset neural network model includes:

At the probability calculation layer, based on pre-learned probabilities of feature vectors of each preset speech segment relative to each preset phoneme, determine the probability of feature vectors of the at least two frames of speech segments relative to each preset phoneme, respectively ;

The first probability vector of the speech segment is determined based on the probability of the feature vector of the speech segment relative to each of the preset phonemes.
The terminal device according to claim 11, wherein the joint timing classification layer in the preset neural network model determines the text sequence corresponding to the speech sequence based on the first probability vectors of all the speech segments ,include:

In the joint time-series classification layer of the preset neural network model, based on the first probability vector of all the speech segments and a preset probability calculation formula, calculate the pronunciation phoneme probability vector of the speech sequence; in the pronunciation phoneme probability vector The value of each element is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element, and the preset probability calculation formula is as follows:

among them,
Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment;

Determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the text sequence corresponding to the speech sequence as the element correspondence The probability of the preset text sequence, the preset text sequence is obtained by compressing the preset pronunciation phoneme sequence;

Determining the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
The terminal device according to any one of claims 11-14, characterized in that before acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments, the method further includes:

Obtain a preset sample data set, and divide the sample data set into a training set and a test set; each sample data in the sample data set is characterized by all the speech segments obtained by framing a speech sequence Vector and the text sequence corresponding to the speech sequence;

Train the pre-built original neural network model based on the training set, and determine the values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model;

Verify the trained original neural network model based on the test set;

If the verification is passed, the original neural network model that has completed training is determined as the preset neural network model.
A non-volatile readable storage medium that stores computer-readable instructions, characterized in that, when the computer-readable instructions are executed by a processor, the following steps are implemented:

Obtain the speech sequence to be recognized, and divide the speech sequence into at least two frames of speech segments;

Performing acoustic feature extraction on the speech segment to obtain a feature vector of the speech segment;

The probability calculation layer of the preset neural network model determines the first probability vector of the speech segment based on the feature vector of the speech segment; the value of each element in the first probability vector is used to identify the speech segment Is pronounced as the probability of the preset phoneme corresponding to the element;

In the joint time-series classification layer of the preset neural network model, a text sequence corresponding to the speech sequence is determined based on the first probability vectors of all the speech segments.
The non-volatile readable storage medium according to claim 16, wherein the acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments include:

Frame the speech sequence based on a preset frame length and a preset frame shift amount to obtain at least two frames of speech segments; the duration of each speech segment is the preset frame length.
The non-volatile readable storage medium according to claim 16, wherein the probability calculation layer in the preset neural network model determines the first probability of the speech segment based on the feature vector of the speech segment Vector, including:

At the probability calculation layer, based on pre-learned probabilities of feature vectors of each preset speech segment relative to each preset phoneme, determine the probability of feature vectors of the at least two frames of speech segments relative to each preset phoneme, respectively ;

The first probability vector of the speech segment is determined based on the probability of the feature vector of the speech segment relative to each of the preset phonemes.
The non-volatile readable storage medium according to claim 16, wherein the joint timing classification layer in the preset neural network model determines the The text sequence corresponding to the speech sequence, including:

In the joint time-series classification layer of the preset neural network model, based on the first probability vector of all the speech segments and a preset probability calculation formula, calculate the pronunciation phoneme probability vector of the speech sequence; in the pronunciation phoneme probability vector The value of each element is used to identify the probability that the pronunciation phoneme sequence corresponding to the speech sequence is the preset pronunciation phoneme sequence corresponding to the element, and the preset probability calculation formula is as follows:

among them,
Represents the value of the i-th element in the phoneme probability vector of pronunciation, i∈ [1, N T ], T represents the total number of speech segments obtained by framing the speech sequence, and N represents the total number of preset phonemes, N T represents the total number of preset pronunciation phoneme sequences of length T formed by combining at least one of the preset phonemes in N preset phonemes, and y it means that the i-th preset pronunciation phoneme sequence contains The a priori probability corresponding to the t-th pronunciation phoneme, t ∈ [1, T], the value of the a priori probability corresponding to the t-th pronunciation phoneme is determined according to the first probability vector of the t-th speech segment;

Determine the text sequence probability vector of the speech sequence based on the pronunciation phoneme probability vector of the speech sequence; the value of each element in the text probability sequence vector is used to identify the text sequence corresponding to the speech sequence as the element correspondence The probability of the preset text sequence, the preset text sequence is obtained by compressing the preset pronunciation phoneme sequence;

Determining the preset text sequence corresponding to the element with the largest value in the probability vector of the text sequence as the text sequence corresponding to the speech sequence.
The non-volatile readable storage medium according to any one of claims 16 to 19, wherein before the acquiring the speech sequence to be recognized and dividing the speech sequence into at least two frames of speech segments, further comprising :

Obtain a preset sample data set, and divide the sample data set into a training set and a test set; each sample data in the sample data set is characterized by all the speech segments obtained by framing a speech sequence Vector and the text sequence corresponding to the speech sequence;

Train the pre-built original neural network model based on the training set, and determine the values of each preset parameter included in the feature extraction layer and the joint time series classification layer of the original neural network model;

Verify the trained original neural network model based on the test set;

If the verification is passed, the original neural network model that has completed training is determined as the preset neural network model.