CN114299937A - DNN model training method and voice recognition method and device - Google Patents

DNN model training method and voice recognition method and device Download PDF

Info

Publication number
CN114299937A
CN114299937A CN202111664614.6A CN202111664614A CN114299937A CN 114299937 A CN114299937 A CN 114299937A CN 202111664614 A CN202111664614 A CN 202111664614A CN 114299937 A CN114299937 A CN 114299937A
Authority
CN
China
Prior art keywords
loss function
dnn model
training
label
peak
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111664614.6A
Other languages
Chinese (zh)
Other versions
CN114299937B (en
Inventor
于丽美
梁微
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mgjia Beijing Technology Co ltd
Original Assignee
Mgjia Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mgjia Beijing Technology Co ltd filed Critical Mgjia Beijing Technology Co ltd
Priority to CN202111664614.6A priority Critical patent/CN114299937B/en
Publication of CN114299937A publication Critical patent/CN114299937A/en
Application granted granted Critical
Publication of CN114299937B publication Critical patent/CN114299937B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a DNN model training method, a voice recognition method and a device, wherein the DNN model training method comprises the following steps: acquiring voice characteristic data and a label sequence corresponding to a training audio sample; inputting the voice characteristic data into a DNN model for label prediction, and determining a label prediction path; calculating a model loss function value of the DNN model; calculating a peak loss function value of the DNN model; determining a total loss function of the DNN model; and training the DNN model based on the total loss function until the DNN model meets the preset training requirement. The trained DNN model is finally obtained by carrying out convergence training on the model loss function, the peak loss function and the total loss function obtained by combining of the DNN model, and the decoding speed and the recognition speed are greatly improved by solving the problem of alignment delay of an audio sequence and a tag sequence in the voice recognition process.

Description

DNN model training method and voice recognition method and device
Technical Field
The invention relates to the technical field of voice recognition, in particular to a DNN model training method, a voice recognition method and a device.
Background
With the continuous development of artificial intelligence, speech recognition is becoming one of the indispensable functions of various intelligent devices. To ensure the accuracy of speech recognition, it is usually necessary to align the input sound sequence with the length of the output tag sequence, but the process of aligning the sound sequence with the length of the output tag sequence requires a lot of time and labor. In order to solve the problem, a CTC (connectivity Temporal classification) time sequence classification algorithm is proposed, based on a Deep Neural Network (DNN) model, a tag corresponding to each frame of audio can be calculated through the CTC time sequence classification algorithm, and a blank tag corresponding to a mute frame is removed during output to obtain a final tag sequence, so that the audio sequence and the tag sequence are directly aligned. However, because blank tags are assigned to silent frames in the CTC time sequence classification algorithm, in the process of DNN model training including the CTC time sequence classification algorithm, there is more than one path from the start tag to the end tag, in different tag paths, the number and the position of the blank tags are different, and the CTC time sequence classification algorithm is not aware of the position where the tag with the highest probability appears, so that the position where the root mean square, i.e., the Peak value appears, is not fixed, and does not necessarily appear at the position of the corresponding frame, which may cause alignment delay, and further cause the problems of slow decoding speed and slow recognition speed.
Disclosure of Invention
Therefore, the technical problem to be solved by the present invention is to overcome the defects of slow decoding speed and slow recognition speed caused by alignment delay of the trained DNN model in the speech recognition process in the prior art, thereby providing a DNN model training method, a speech recognition method and a speech recognition device.
According to a first aspect, an embodiment of the present invention provides a DNN model training method, including:
acquiring voice characteristic data and a label sequence corresponding to a training audio sample;
inputting the voice characteristic data into a DNN model for label prediction, and determining a label prediction path;
calculating a model loss function value for the DNN model based on the probability of the label prediction path;
respectively calculating the distance between the position of each audio frame in the training audio sample and the corresponding label sequence center point thereof based on the label prediction path;
calculating a peak loss function value of the DNN model based on a distance between a position of each audio frame in the training audio sample and a corresponding tag sequence center point thereof;
determining a total loss function for the DNN model based on the model loss function values and the peak loss function values;
training the DNN model based on the total loss function until the DNN model meets preset training requirements.
Optionally, the calculating a model loss function value for the DNN model based on the probability of the label predicted path comprises:
obtaining the probability of each label prediction path corresponding to the current audio frame;
determining the probability of the total path corresponding to the current audio frame based on the probability of each label prediction path;
calculating a model loss function value of the DNN model based on the probability of the total path corresponding to each audio frame.
Optionally, the calculating, based on the label prediction path, a distance between a position of each audio frame in the training audio sample and a center point of a corresponding label sequence thereof includes:
calculating a peak value of the current audio frame based on the voice characteristic data of the current audio frame and acquiring the position of the peak value;
calculating the position of the central point of the label sequence corresponding to the current audio frame;
and calculating the distance between the peak position corresponding to the current audio frame and the central point of the label sequence corresponding to the peak position based on the position of the peak and the central point of the label sequence.
Optionally, the calculating a peak loss function value of the DNN model based on a distance between a position of each audio frame in the training audio sample and a center point of a corresponding tag sequence thereof includes:
calculating the average distance of each label path based on the distance between the position of each audio frame in the training audio sample and the corresponding label sequence center point;
calculating the average value of distance loss of all paths based on the average distance of each label path;
determining a peak loss function value for the DNN model based on an average of all path distance losses.
Optionally, the determining a total loss function for the DNN model based on the peak loss function value and the model loss function value comprises:
respectively setting loss influence parameters of the peak loss function and the model loss function;
determining a total loss function for the DNN model based on the loss impact parameter, the peak loss function value, and the model loss function value.
Optionally, the method further comprises:
calculating the probability of outputting non-empty labels under the condition of empty labels based on the relation between the labels corresponding to the first voice features corresponding to each audio frame in the voice feature data in the label sequence and the predicted labels for performing label prediction by the DNN model;
calculating a cross-entropy loss for the DNN model based on a probability of outputting a non-null label in the case of a null label.
Optionally, the determining a total loss function for the DNN model based on the peak loss function value and the model loss function value further comprises:
setting a cross entropy loss function, the peak loss function and a loss influence parameter of the DNN model;
determining a total loss function for the DNN model based on the loss impact parameters, a cross-entropy loss function value for the DNN model, the peak loss function value, and the model loss function value.
According to a second aspect, an embodiment of the present invention provides a speech recognition method, including:
acquiring audio data to be identified;
performing DNN model training by using the DNN model training method as described in the first aspect, or any one of the optional embodiments of the first aspect, to obtain a trained DNN model;
and inputting the audio data to be recognized into the trained DNN model for recognition to obtain a recognition result of the audio data to be recognized.
According to a third aspect, an embodiment of the present invention provides a DNN model training apparatus, including:
the first acquisition module is used for acquiring the voice characteristic data and the label sequence corresponding to the training audio sample;
the first processing module is used for inputting the voice characteristic data into a DNN model for label prediction and determining a label prediction path;
a second processing module to calculate a model loss function value for the DNN model based on the probability of the label prediction path;
the third processing module is used for respectively calculating the distance between the position of each audio frame in the training audio sample and the corresponding label sequence center point thereof based on the label prediction path;
the fourth processing module is used for calculating a peak loss function value of the DNN model based on the distance between the position of each audio frame in the training audio sample and the center point of the corresponding label sequence;
a calculation module to determine a total loss function for the DNN model based on the model loss function values and the peak loss function values;
and the training module is used for training the DNN model based on the total loss function until the DNN model meets the preset training requirement.
According to a fourth aspect, an embodiment of the present invention provides a speech recognition apparatus, including:
the second acquisition module is used for acquiring the audio data to be identified;
an execution module, configured to perform DNN model training by using the DNN model training apparatus according to the third aspect, to obtain a trained DNN model;
and the recognition module is used for inputting the audio data to be recognized into the trained DNN model for recognition to obtain a recognition result of the audio data to be recognized.
According to a fifth aspect, an embodiment of the present invention provides an electronic device, including:
the storage and the processor are connected with each other in a communication mode, the storage stores computer instructions, and the processor executes the computer instructions to execute the first/second aspect or the method in any one optional implementation manner of the first/second aspect.
According to a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause the computer to execute the first/second aspect, or the method described in any one of the optional implementation manners of the first/second aspect.
The technical scheme of the invention has the following advantages:
according to the DNN model training method and device provided by the embodiment of the invention, the voice characteristic data and the label sequence corresponding to the training audio sample are obtained; inputting the voice characteristic data into a DNN model for label prediction, and determining a label prediction path; calculating a model loss function value for the DNN model based on the probability of the label prediction path; respectively calculating the distance between the position of each audio frame in the training audio sample and the corresponding label sequence center point thereof based on the label prediction path; calculating a peak loss function value of the DNN model based on a distance between a position of each audio frame in the training audio sample and a corresponding tag sequence center point thereof; determining a total loss function for the DNN model based on the model loss function values and the peak loss function values; training the DNN model based on the total loss function until the DNN model meets preset training requirements. The trained DNN model is finally obtained by carrying out convergence training on the model loss function, the peak loss function and the total loss function obtained by combining of the DNN model, and the decoding speed and the recognition speed are greatly improved by solving the problem of alignment delay of an audio sequence and a tag sequence in the voice recognition process.
According to the voice recognition method and the voice recognition device, the audio data to be recognized are obtained; carrying out DNN model training by adopting the DNN model training method provided by the other embodiment of the invention to obtain a trained DNN model; and inputting the audio data to be recognized into the trained DNN model for recognition to obtain a recognition result of the audio data to be recognized. The trained DNN model is finally obtained by carrying out convergence training on the model loss function, the peak loss function and the total loss function obtained by combining of the DNN model, and the decoding speed and the recognition speed are greatly improved by solving the problem of alignment delay of an audio sequence and a tag sequence in the voice recognition process.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of a DNN model training method according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an example of path computation in a DNN model training method according to an embodiment of the present invention;
FIG. 3 is an exemplary diagram of an audio frame probability peak of a DNN model training method according to an embodiment of the present invention;
FIG. 4 is an exemplary diagram of an audio tag of a DNN model training method according to an embodiment of the present invention;
FIG. 5 is a training model structure diagram of a DNN model training method according to an embodiment of the present invention;
FIG. 6 is an exemplary diagram of audio frame alignment of a DNN model training method according to an embodiment of the present invention;
FIG. 7 is a flow chart of a speech recognition method according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a DNN model training apparatus according to an embodiment of the present invention;
FIG. 9 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; the two elements may be directly connected or indirectly connected through an intermediate medium, or may be communicated with each other inside the two elements, or may be wirelessly connected or wired connected. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
An embodiment of the present invention provides a DNN model training method, and as shown in fig. 1, the DNN model training method specifically includes the following steps:
step S101: and acquiring voice characteristic data and a label sequence corresponding to the training audio sample.
Specifically, in practical application, the FILTERBANK feature obtained from the audio data is firstly used as the input of the whole training model, wherein the FILTERBANK feature is obtained by performing pre-emphasis, framing, windowing and other calculations on the audio data, and further, the FILTERBANK feature of 75 dimensions is calculated based on the DNN model to obtain the deep speech feature, but the practical situation is not limited to this, and the change of the feature type or the audio feature dimension for obtaining the accurate deep speech feature is also within the protection range of the DNN model training method provided by the present invention.
Step S102: and inputting the voice characteristic data into a DNN model for label prediction, and determining a label prediction path.
Specifically, in practical application, the DNN model of the embodiment of the present invention includes a CTC algorithm, and performs tag prediction based on the CTC algorithm and determines a tag prediction path. The CTC algorithm model is used as an end-to-end model, an audio sequence and a tag sequence can be directly aligned, but the CTC algorithm has some defects, so that the CTC algorithm needs to be optimized in the DNN model training process. There are two common problems with CTC algorithms: (1) the training is easy to disperse; (2) there is an alignment delay, which causes these two problems because during the CTC operation, a blank label, denoted as < b >, is assigned to the meaningless frame. Blank tags do not have any meaning, but since the occupancy of blank tags in an audio sequence is very large, blank tags that would cause CTCs to learn too much if CTC optimization training were performed alone, would cause the DNN model training to be divergent. In addition, in the computation process based on the CTC algorithm, there may be more than one path from the start tag to the end tag, for example, for the path of the sequence "exit navigation", the path may be "< b > exit < b > < b > exit out of < b > < b > < b > navigation < b > < b >" or "< b > exit < b > < b > exit > navigation < b >" so that the number and the position of the "blank" tags are different, which results in that the predicted path of the tag has more than one path.
Step S103: model loss function values for the DNN model are calculated based on the probabilities of the label prediction paths. In practical application, the data characteristics input by the DNN model are calculated through a CTC algorithm, the label corresponding to each frame of audio is predicted, and the model loss function can be expressed as CTC loss.
Specifically, in an embodiment, the step S103 specifically includes the following steps:
step S301: and obtaining the probability of each label prediction path corresponding to the current audio frame.
Step S302: and determining the probability of the total path corresponding to the current audio frame based on the probability of each label prediction path.
Step S303: and calculating a model loss function value of the DNN model based on the probability of the total path corresponding to each audio frame.
Specifically, in practical applications, the calculation process of the model loss function (CTC loss) value of the DNN model is as shown in equations (1) to (3).
Figure BDA0003450728680000111
Figure BDA0003450728680000112
Lctc(x)=-log(P(l|X)) (3)
Firstly, the label prediction is carried out on the audio of each frame, and the calculation process is shown as formula (1), wherein p (pi | x) represents that for an input audio sequence x, a label sequence pi is obtained. The tag sequence π is derived from the audio frame y at each time ttAnd the probability of one label path is the multiplication of the label probability of each frame. Where L' is the set of all tags that contain an empty tag. p (pi | x) is the probability of a label path,the probability of all paths is the accumulation of the probability of each label path, and the calculation process is as formula (2), wherein pi represents the label set B with blank removed-1(l) Of (1). And finally obtaining the final result of the CTC loss by taking the logarithm.
Illustratively, as shown in fig. 2, for each frame of an audio sequence, the CTC algorithm first predicts its label, which is shown as two different CTC paths, the first path result being < b > < b > < t-w-ey4> < b > < ch-u1> < ch-u1> < b > < d-aw3> < h-a2ng >, and the second path result being < b > < t-w-ey4> < b > < ch-u1> < b > < d-aw3> < b > < h-a2ng > < h-a2ng > < b >, wherein the probability of one path is the product of the probabilities of all the labels in the path, and if there are only two paths, the final probability is the sum of the probabilities of the two paths, and in order to ensure the accuracy of the final speech recognition, the probability needs to be maximized, and at the same time, since a negative is taken in the CTC loss function, so the losses during training are minimized. In addition, the third path in the graph is a wrong path, and the CTC algorithm does not find a wrong path that has three error points: (1) a frame of labels in the path walk upwards to the right, which is contrary to the law that only the labels walk downwards to the right; (2) skipping over non-empty tags, contrary to the rule that non-empty tags cannot be skipped over; (3) the end of the path is a repeated pronunciation tag, contrary to the end being the last empty tag or pronunciation tag rule.
Step S104: and respectively calculating the distance between the position of each audio frame in the training audio sample and the corresponding label sequence center point based on the label prediction path.
Specifically, in practical application, while reducing the divergence of the CTC algorithm, the alignment delay problem of the CTC algorithm needs to be processed, for example, as shown in fig. 3, the pinyin sequences corresponding to the "exit navigation" are "t-w-ey 4 ch-u1 d-aw3 h-a2 ng", the numbers 1 to 4 are "t-w-ey 4", "ch-u 1", "d-aw 3", and "h-a 2 ng", the number 5 is a blank label, the number 6 is a mute frame, the positions of the audio frames corresponding to the pinyin labels are the corresponding numbers on the coordinate axes, the positions of the audio frames and the positions of the probability peaks corresponding to the pinyin labels can be visually observed through the graph, for example, the pinyin "t-w-ey 4" corresponds to the positions of the 9 th frame to the 13 th frame, which means that the labels corresponding to the audios from 9 to 13 are all "t-w-ey 4", at this time, the alignment interval (align interval) of t-w-ey4 is [9,13], and during the DNN model training process, for the purpose of probability maximization, the CTC algorithm searches for a probability peak, as can be seen from fig. 3, the probability peak of t-w-ey4 appears at the position of 17 frames, and differs from the start position of align interval, i.e., at 9 th frame by 8 frames, at this time, the DNN delay will be 80ms, which eventually results in a slow decoding speed. By controlling the position of the peak value in the alignment interval (align interval), the problem of slow decoding caused by alignment delay can be effectively solved.
Specifically, in an embodiment, the step S104 specifically includes the following steps:
step S401: and calculating the peak value of the current audio frame based on the voice characteristic data of the current audio frame and acquiring the position of the peak value.
Step S402: and calculating the position of the central point of the label sequence corresponding to the current audio frame.
Step S403: and calculating the distance between the peak position corresponding to the current audio frame and the central point of the corresponding label sequence based on the position of the peak and the central point position of the label sequence.
Specifically, in practical applications, there are many frames in an audio sequence, and fig. 4 exemplarily shows a 19-frame audio sequence, and for each frame of audio, there is a corresponding label, and the unit of the label may be a text, a pinyin, a phoneme, and so on. The unit used in this patent is the pinyin for a Chinese character. 3015 pinyins are total, so the label sequence is
Figure BDA0003450728680000131
Figure BDA0003450728680000132
As can be seen, the audio of the 1 st to 3 rd frames corresponds to the empty label, the data of the 4 th to 8 th frames corresponds to the pinyin represented by 2387, the so-called align interval is that each label pairThe corresponding audio frame position constitutes an interval, and the alignment interval of the label 2387 of the audio in the figure is [4,8 ]]And controlling Peak to be in an audio align interval, namely training the model to enable the Peak value in the probability distribution of the model to appear in the align interval as much as possible.
Specifically, in practical application, the distance between the peak position corresponding to the current audio frame and the center point of the corresponding tag sequence is calculated according to formula (4), where the formula is as follows:
dist=position(xt)-position(at m) (4)
after the peak value of the audio frame at the time t is calculated, the position of the current audio frame is taken as the position of the probability peak value and is recorded as position (x)t) Wherein x istRepresenting the audio features at time t, and then calculating the position positon (a) of the center point of the align label sequence corresponding to the current timet m) Wherein a ist mAnd representing the center point m of the align label a corresponding to the audio frame at the time t. The distance difference is then obtained by equation (4). The problem of model training divergence can be better avoided by increasing the distance loss, meanwhile, the distance loss can control the maximization of the probability of the label path with low learning delay of the model, and the problems of low decoding speed and low recognition speed of the DNN model caused by the alignment delay of the CTC are effectively solved.
Step S105: and calculating a peak loss function value of the DNN model based on the distance between the position of each audio frame in the training audio sample and the center point of the corresponding label sequence.
Specifically, in an embodiment, the step S104 and the step S105 are executed to perform Peak loss (Peak loss) calculation on the DNN training model, and the calculation process includes firstly calculating a Peak value according to the obtained CTC algorithm path, then calculating a distance between a Peak position and a center point of a tag sequence corresponding to the current audio frame, and calculating Peak loss according to the distance. In addition, since aligned sequence paths calculated by the CTC are different, the optimal path may have more empty tags between two non-empty tags, and the empty tags do not correspond to any content, so that the empty tags cannot decode corresponding characters in the decoding stage, which may lengthen the decoding time. In the streaming recognition process, the long decoding time results in a long recognition time. For example, for a vehicle-mounted voice system, when a user sends a command of playing songs, the DNN model executes the command of playing music one or two seconds after the command is sent due to the existence of CTC algorithm alignment delay, so that the waiting time of the user is too long, the experience feeling is poor, the alignment delay time is shortened through a Peak loss training model, and the problem of poor user experience feeling caused by alignment delay can be effectively solved.
Specifically, in an embodiment, the step S105 specifically includes the following steps:
step S501: and calculating the average distance of each label path based on the distance between the position of each audio frame in the training audio sample and the center point of the corresponding label sequence.
Step S502: based on the average distance of each label path, the average of all path distance losses is calculated.
Step S503: the peak loss function value of the DNN model is determined based on the average of all path distance losses.
Specifically, in practical applications, the peak loss value is calculated by the following equations (4) to (6), and the equations (5) and (6) are as follows:
Figure BDA0003450728680000151
Figure BDA0003450728680000152
wherein dist is a distance difference calculated by formula (4); the loss of each label path is represented by the average distance calculated by formula (5), and T represents the total frame number of the current audio sequence; finally, taking the average value of the distance losses of all paths as the calculation result of the distance losses, where the calculation process is as in formula (6), N represents the number of all label paths, but the calculation amount for calculating all paths is considered to be too large, so the embodiment of the present invention takes the three paths with the largest probability to calculate, that is, N is 3, but the actual situation is not limited thereto, and the number of paths is changed to reduce the loss of the training model, which is also within the protection scope of the DNN model training method provided by the present invention. Compared with a method for deleting candidate CTC paths by directly limiting delay time, the method for increasing distance loss Peak loss combined training is softer, can reduce the alignment delay while ensuring the accuracy, and finds out the critical point between the identification accuracy and the delay time by automatically learning the training model through the loss control model.
Step S106: based on the model loss function values and the peak loss function values, a total loss function for the DNN model is determined. Therefore, the peak loss function value is added into the total loss function, so that the delay time of the DNN model is greatly improved, the problem of alignment delay of the audio sequence and the label sequence in the voice recognition process is solved, and the decoding speed and the recognition speed are improved.
Specifically, in an embodiment, in order to further improve the convergence speed of the DNN model, the DNN model training method provided in the embodiment of the present invention further includes: calculating the probability of outputting non-empty labels under the condition of empty labels based on the relation between the labels corresponding to the first voice features corresponding to each audio frame in the voice feature data in the label sequence and the predicted labels for performing label prediction by the DNN model; based on the probability of outputting a non-null label in the case of a null label, the cross-entropy loss of the DNN model is calculated.
In practical application, in order to avoid divergence of a training model, the embodiment of the present invention calculates a Cross Entropy loss function (CE loss for short) value of a DNN model through an output of the DNN network model and an align label.
Specifically, the formula for calculating CE loss is as follows:
Figure BDA0003450728680000161
Figure BDA0003450728680000162
wherein x isiI-th frame representing input data, j representing total frame number, yiThe label corresponding to the ith frame representing the input data characteristic is y, p (y)1| X) represents the probability of an empty label, fiNon-empty object tag, p (y), indicating the frame leveli| X) represents the probability of outputting a non-empty label if there is an empty label, and K represents the number of label categories.
Further, the step S106 includes: setting a cross entropy loss function, a peak loss function and a loss influence parameter of a model loss function of the DNN model; a total loss function of the DNN model is determined based on the loss impact parameters, the cross entropy loss function values, the peak loss function values, and the model loss function values of the DNN model.
In practical application, since the results of the three losses are linear, in the joint training, the DNN loss, the CE loss and the Peak loss are linearly summed through parameters, and the result of the linear summation is used as the loss calculated by the final model. Specifically, the formula of the total loss function of the DNN model is as follows:
Loss=Lctc+Lce+βLpeak (9)
wherein Loss represents the total Loss function; l isctcRepresenting a model loss function value; l ispeakRepresenting a peak loss function value; l isceRepresenting a cross entropy loss function value; β represents a loss impact parameter.
Specifically, in practical application, β is a parameter for controlling the influence degree of cross entropy loss and distance loss on the model, and the value range is [0, 1%]The larger the value of beta is, the higher the influence degree of loss (loss) on the training model is. In the embodiment of the invention, L is used for better convergence of the model and avoiding the problem of error rate increase caused by reducing alignment delayceAnd LctcTraining with the same specific gravity, LpeakThe DNN model training method provided by the invention has the advantages that the DNN model training method occupies a relatively small proportion, so the value is beta which is 0.5, but the actual situation is not limited to this, and the change of the loss influence parameter value is performed to ensure better convergence of the training model and avoid errors caused by alignment delay, which is also within the protection range of the DNN model training method provided by the invention.
In practical application, the embodiment of the invention adopts a CTC loss combined CE loss and Peak loss training method, controls the probability distribution according to Peak loss learning through a Peak loss control model, ensures that the position where a Peak value appears is as much as possible in an align interval, and reduces the time of CTC alignment delay while ensuring the accuracy. And then, performing combined training by using Peak loss, CE loss and CTC loss, and finally avoiding the training divergence of the DNN model.
Step S107: and training the DNN model based on the total loss function until the DNN model meets the preset training requirement.
In the process of training the speech recognition model, only the CTC is optimally trained, and the problem of divergence is easily caused due to the fact that the proportion of empty labels of an audio sequence is large, so that the model takes a long time for training but cannot be converged, a good training result cannot be obtained, and the time consumption is large. The auxiliary CTC loss in the training process of Peak loss and CE loss can enable the DNN model to be converged as soon as possible, and the time required by training is reduced.
By executing the steps, the DNN model training method provided by the embodiment of the present invention obtains the speech feature data and the tag sequence corresponding to the training audio sample; inputting the voice characteristic data into a DNN model for label prediction, and determining a label prediction path; calculating a model loss function value of the DNN model based on the probability of the label prediction path; respectively calculating the distance between the position of each audio frame in the training audio sample and the corresponding label sequence center point based on the label prediction path; calculating a peak loss function value of the DNN model based on the distance between the position of each audio frame in the training audio sample and the corresponding label sequence center point; determining a total loss function of the DNN model based on the model loss function value and the peak loss function value; and training the DNN model based on the total loss function until the DNN model meets the preset training requirement. The trained DNN model is finally obtained by carrying out convergence training on the model loss function, the peak loss function and the total loss function obtained by combining of the DNN model, and the decoding speed and the recognition speed are greatly improved by solving the problem of alignment delay of an audio sequence and a tag sequence in the voice recognition process.
A detailed description will be given below of a DNN model training method provided in an embodiment of the present invention with reference to a specific application example.
In conjunction with fig. 3 and 5, in the case where the Peak position is not defined, there is a significant delay in DNN decoding, where the curve numbered 1 indicates that the corresponding align label is t-w-ey4, and the align interval is the segment numbered 1 on the horizontal axis of the coordinate and is [9,13], whereas in the DNN decoding process, the probability Peak of t-w-ey4 appears at the position of the 17 th frame and is significantly not within the align interval [9,13 ]. Compared with the interval starting end point, the distance is 8 frames, the time length of each frame of audio is 10ms, and the decoding delay is 80ms at this time. In the model training process of the embodiment of the invention, the distance between the Peak value of Peak and the midpoint of align interval is calculated. And when the difference value is 6, the Peak loss of the audio frame is 6, so that the calculation of the loss of one frame is completed, and for an audio path, a plurality of frames exist, the total loss of each path is the sum of the Peak losses of all the frames to average, the losses of all the paths are the sum of the losses of n paths, and finally, the average value is calculated according to the path number to obtain the final loss. However, in the process of reducing the Peak value and the midpoint of the align interval, the label corresponding to the Peak value may change, and as shown in fig. 5 and fig. 6, when the label ch-u1 is forced to be in the align interval, the probability of ch-u1 is smaller than the probability of d-wa3, and at this time, the Peak value changes, which causes an error in decoding, so that the delay is reduced and the error rate of the model increases.
An embodiment of the present invention provides a speech recognition method, as shown in fig. 7, the speech recognition method specifically includes the following steps:
step S201: and acquiring audio data to be identified. In practical application, after audio data to be recognized are obtained, the audio data need to be processed, the FILTERBANK characteristics are obtained by performing computing processing such as pre-emphasis, framing and windowing on the audio data, the FILTERBANK characteristics are 75-dimensional, and computing is performed based on a DNN (deep speech model) to obtain deep speech characteristics of the audio data to be recognized.
Step S202: and (3) carrying out DNN model training by adopting the DNN model training method to obtain the trained DNN model.
Step S203: and inputting the audio data to be recognized into the trained DNN model for recognition to obtain a recognition result of the audio data to be recognized.
By executing the steps, the voice recognition method provided by the embodiment of the invention obtains the audio data to be recognized; carrying out DNN model training by adopting the DNN model training method provided by the other embodiment of the invention to obtain a trained DNN model; and inputting the audio data to be recognized into the trained DNN model for recognition to obtain a recognition result of the audio data to be recognized. The trained DNN model is finally obtained by carrying out convergence training on the model loss function, the peak loss function and the total loss function obtained by combining of the DNN model, and the decoding speed and the recognition speed are greatly improved by solving the problem of alignment delay of an audio sequence and a tag sequence in the voice recognition process.
An embodiment of the present invention provides a DNN model training apparatus, as shown in fig. 8, the DNN model training apparatus includes:
the first obtaining module 101 is configured to obtain speech feature data and a tag sequence corresponding to a training audio sample. For details, refer to the related description of step S101 in the above method embodiment, and no further description is provided here.
The first processing module 102 is configured to input the speech feature data into a DNN model for tag prediction, and determine a tag prediction path. For details, refer to the related description of step S102 in the above method embodiment, and no further description is provided here.
A second processing module 103, configured to calculate a model loss function value of the DNN model based on the probability of the label prediction path. For details, refer to the related description of step S103 in the above method embodiment, and no further description is provided here.
And the third processing module 104 is configured to calculate, based on the label prediction path, a distance between a position of each audio frame in the training audio sample and a center point of a corresponding label sequence thereof, respectively. For details, refer to the related description of step S104 in the above method embodiment, and no further description is provided here.
The fourth processing module 105 is configured to calculate a peak loss function value of the DNN model based on a distance between a position of each audio frame in the training audio sample and a center point of a corresponding tag sequence thereof. For details, refer to the related description of step S105 in the above method embodiment, and no further description is provided here.
A calculation module 106 configured to determine a total loss function of the DNN model based on the model loss function value and the peak loss function value. For details, refer to the related description of step S106 in the above method embodiment, and no further description is provided here.
The training module 107 is configured to train the DNN model based on the total loss function until the DNN model meets a preset training requirement. For details, refer to the related description of step S107 in the above method embodiment, and no further description is provided here.
For further description of the DNN model training apparatus, reference is made to the related description of the DNN model training method embodiment, which is not repeated herein.
Through the cooperative cooperation of the above components, the DNN model training apparatus provided in the embodiment of the present invention performs the convergence training on the model loss function, the peak loss function, and the total loss function obtained by combining the DNN model, so as to finally obtain the trained DNN model, and by solving the problem of alignment delay between the audio sequence and the tag sequence in the speech recognition process, the decoding speed and the recognition speed are greatly improved.
An embodiment of the present invention provides a speech recognition apparatus, as shown in fig. 9, the speech recognition apparatus includes:
and a second obtaining module 201, configured to obtain audio data to be identified. For details, refer to the related description of step S201 in the above method embodiment, and no further description is provided here.
The executing module 202 is configured to perform DNN model training by using the DNN model training apparatus as described above, so as to obtain a trained DNN model. For details, refer to the related description of step S202 in the above method embodiment, and no further description is provided herein.
And the identification module 203 is configured to input the audio data to be identified into the trained DNN model for identification, so as to obtain an identification result of the audio data to be identified. For details, refer to the related description of step S203 in the above method embodiment, and no further description is provided here.
For further description of the voice recognition apparatus, reference is made to the related description of the embodiment of the voice recognition method, and details are not repeated herein.
Through the cooperative cooperation of the above components, the speech recognition device provided in the embodiment of the present invention performs the convergence training on the model loss function, the peak loss function and the total loss function obtained by combining the DNN model to finally obtain the trained DNN model, and by solving the problem of alignment delay between the audio sequence and the tag sequence in the speech recognition process, the decoding speed and the recognition speed are greatly improved.
An embodiment of the present invention provides an electronic device, as shown in fig. 10, the electronic device includes a processor 901 and a memory 902, and the memory 902 and the processor 901 are communicatively connected to each other, where the processor 901 and the memory 902 may be connected by a bus or in another manner, and fig. 10 takes the connection by the bus as an example.
Processor 901 may be a Central Processing Unit (CPU). The Processor 901 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.
The memory 902, which is a non-transitory computer readable storage medium, may be used for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the methods of the embodiments of the present invention. The processor 901 executes various functional applications and data processing of the processor by executing non-transitory software programs, instructions and modules stored in the memory 902, that is, implements the methods in the above-described method embodiments.
The memory 902 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 901, and the like. Further, the memory 902 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 902 may optionally include memory located remotely from the processor 901, which may be connected to the processor 901 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
One or more modules are stored in the memory 902, which when executed by the processor 901 performs the methods in the above-described method embodiments.
The specific details of the electronic device may be understood by referring to the corresponding related descriptions and effects in the above method embodiments, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, and the implemented program can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (12)

1. A DNN model training method is characterized by comprising the following steps:
acquiring voice characteristic data and a label sequence corresponding to a training audio sample;
inputting the voice characteristic data into a DNN model for label prediction, and determining a label prediction path;
calculating a model loss function value for the DNN model based on the probability of the label prediction path;
respectively calculating the distance between the position of each audio frame in the training audio sample and the corresponding label sequence center point thereof based on the label prediction path;
calculating a peak loss function value of the DNN model based on a distance between a position of each audio frame in the training audio sample and a corresponding tag sequence center point thereof;
determining a total loss function for the DNN model based on the model loss function values and the peak loss function values;
training the DNN model based on the total loss function until the DNN model meets preset training requirements.
2. The method of claim 1, wherein the calculating a model loss function value for the DNN model based on the probability of the label predicted path comprises:
obtaining the probability of each label prediction path corresponding to the current audio frame;
determining the probability of the total path corresponding to the current audio frame based on the probability of each label prediction path;
calculating a model loss function value of the DNN model based on the probability of the total path corresponding to each audio frame.
3. The method of claim 2, wherein the calculating the distance between the position of each audio frame in the training audio sample and the corresponding tag sequence center point based on the tag prediction path comprises:
calculating a peak value of the current audio frame based on the voice characteristic data of the current audio frame and acquiring the position of the peak value;
calculating the position of the central point of the label sequence corresponding to the current audio frame;
and calculating the distance between the peak position corresponding to the current audio frame and the central point of the label sequence corresponding to the peak position based on the position of the peak and the central point of the label sequence.
4. The method of claim 1, wherein calculating the peak loss function value for the DNN model based on a distance between a location of each audio frame in the training audio sample and its corresponding tag sequence center point comprises:
calculating the average distance of each label path based on the distance between the position of each audio frame in the training audio sample and the corresponding label sequence center point;
calculating the average value of distance loss of all paths based on the average distance of each label path;
determining a peak loss function value for the DNN model based on an average of all path distance losses.
5. The method of claim 1, wherein the determining a total loss function for the DNN model based on the peak loss function value and the model loss function value comprises:
respectively setting loss influence parameters of the peak loss function and the model loss function;
determining a total loss function for the DNN model based on the loss impact parameter, the peak loss function value, and the model loss function value.
6. The method of claim 1, further comprising:
calculating the probability of outputting non-empty labels under the condition of empty labels based on the relation between the labels corresponding to the first voice features corresponding to each audio frame in the voice feature data in the label sequence and the predicted labels for performing label prediction by the DNN model;
calculating a cross-entropy loss for the DNN model based on a probability of outputting a non-null label in the case of a null label.
7. The method of claim 6, wherein the determining a total loss function for the DNN model based on the peak loss function values and the model loss function values, further comprises:
setting a cross entropy loss function, the peak loss function and a loss influence parameter of the DNN model;
determining a total loss function for the DNN model based on the loss impact parameters, a cross-entropy loss function value for the DNN model, the peak loss function value, and the model loss function value.
8. A speech recognition method, comprising:
acquiring audio data to be identified;
carrying out DNN model training by adopting the DNN model training method of any one of claims 1 to 7 to obtain a trained DNN model;
and inputting the audio data to be recognized into the trained DNN model for recognition to obtain a recognition result of the audio data to be recognized.
9. A DNN model training device, comprising:
the first acquisition module is used for acquiring the voice characteristic data and the label sequence corresponding to the training audio sample;
the first processing module is used for inputting the voice characteristic data into a DNN model for label prediction and determining a label prediction path;
a second processing module to calculate a model loss function value for the DNN model based on the probability of the label prediction path;
the third processing module is used for respectively calculating the distance between the position of each audio frame in the training audio sample and the corresponding label sequence center point thereof based on the label prediction path;
the fourth processing module is used for calculating a peak loss function value of the DNN model based on the distance between the position of each audio frame in the training audio sample and the center point of the corresponding label sequence;
a calculation module to determine a total loss function for the DNN model based on the model loss function values and the peak loss function values;
and the training module is used for training the DNN model based on the total loss function until the DNN model meets the preset training requirement.
10. A speech recognition apparatus, comprising:
the second acquisition module is used for acquiring the audio data to be identified;
an execution module, configured to perform DNN model training by using the DNN model training apparatus according to claim 9, to obtain a trained DNN model;
and the recognition module is used for inputting the audio data to be recognized into the trained DNN model for recognition to obtain a recognition result of the audio data to be recognized.
11. An electronic device, comprising:
a memory and a processor communicatively coupled to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of any of claims 1-8.
12. A computer-readable storage medium having stored thereon computer instructions for causing a computer to thereby perform the method of any one of claims 1-8.
CN202111664614.6A 2021-12-31 2021-12-31 DNN model training method and voice recognition method and device Active CN114299937B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111664614.6A CN114299937B (en) 2021-12-31 2021-12-31 DNN model training method and voice recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111664614.6A CN114299937B (en) 2021-12-31 2021-12-31 DNN model training method and voice recognition method and device

Publications (2)

Publication Number Publication Date
CN114299937A true CN114299937A (en) 2022-04-08
CN114299937B CN114299937B (en) 2022-07-01

Family

ID=80973331

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111664614.6A Active CN114299937B (en) 2021-12-31 2021-12-31 DNN model training method and voice recognition method and device

Country Status (1)

Country Link
CN (1) CN114299937B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115910044A (en) * 2023-01-10 2023-04-04 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190179905A1 (en) * 2016-08-19 2019-06-13 Huawei Technologies Co.,Ltd. Sequence conversion method and apparatus in natural language processing
CN111415681A (en) * 2020-03-17 2020-07-14 北京奇艺世纪科技有限公司 Method and device for determining musical notes based on audio data
CN112037798A (en) * 2020-09-18 2020-12-04 中科极限元(杭州)智能科技股份有限公司 Voice recognition method and system based on trigger type non-autoregressive model
CN113470626A (en) * 2021-06-30 2021-10-01 北京有竹居网络技术有限公司 Training method, device and equipment of voice recognition model
CN113505611A (en) * 2021-07-09 2021-10-15 中国人民解放军战略支援部队信息工程大学 Training method and system for obtaining better speech translation model in generation of confrontation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190179905A1 (en) * 2016-08-19 2019-06-13 Huawei Technologies Co.,Ltd. Sequence conversion method and apparatus in natural language processing
CN111415681A (en) * 2020-03-17 2020-07-14 北京奇艺世纪科技有限公司 Method and device for determining musical notes based on audio data
CN112037798A (en) * 2020-09-18 2020-12-04 中科极限元(杭州)智能科技股份有限公司 Voice recognition method and system based on trigger type non-autoregressive model
CN113470626A (en) * 2021-06-30 2021-10-01 北京有竹居网络技术有限公司 Training method, device and equipment of voice recognition model
CN113505611A (en) * 2021-07-09 2021-10-15 中国人民解放军战略支援部队信息工程大学 Training method and system for obtaining better speech translation model in generation of confrontation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
P. ZOLFAGHARI ET AL.: "Bayesian modelling of the speech spectrum using mixture of Gaussians", 《2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115910044A (en) * 2023-01-10 2023-04-04 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle
CN115910044B (en) * 2023-01-10 2023-06-30 广州小鹏汽车科技有限公司 Voice recognition method and device and vehicle

Also Published As

Publication number Publication date
CN114299937B (en) 2022-07-01

Similar Documents

Publication Publication Date Title
US11848008B2 (en) Artificial intelligence-based wakeup word detection method and apparatus, device, and medium
CN107908803B (en) Question-answer interaction response method and device, storage medium and terminal
US20180158449A1 (en) Method and device for waking up via speech based on artificial intelligence
JP3581401B2 (en) Voice recognition method
CN113505205B (en) Man-machine dialogue system and method
EP3413305A1 (en) Dual mode speech recognition
KR102313028B1 (en) System and method for voice recognition
US8676583B2 (en) Belief tracking and action selection in spoken dialog systems
WO2017101450A1 (en) Voice recognition method and device
KR20200086214A (en) Real-time speech recognition method and apparatus based on truncated attention, equipment and computer-readable storage medium
CN111344779A (en) Training and/or determining responsive actions for natural language input using coder models
CN103903619B (en) A kind of method and system improving speech recognition accuracy
EP3605537A1 (en) Speech emotion detection method and apparatus, computer device, and storage medium
US10714076B2 (en) Initialization of CTC speech recognition with standard HMM
CN107123417A (en) Optimization method and system are waken up based on the customized voice that distinctive is trained
WO2022134894A1 (en) Speech recognition method and apparatus, computer device, and storage medium
CN111241814B (en) Error correction method and device for voice recognition text, electronic equipment and storage medium
US20170061958A1 (en) Method and apparatus for improving a neural network language model, and speech recognition method and apparatus
JP2002268675A (en) Voice recognition device
CN110970031B (en) Speech recognition system and method
CN111554276B (en) Speech recognition method, device, equipment and computer readable storage medium
CN114299937B (en) DNN model training method and voice recognition method and device
EP3444806A1 (en) Voice recognition-based decoding method and device
CN114067786A (en) Voice recognition method and device, electronic equipment and storage medium
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant