CN116631379B

CN116631379B - Speech recognition method, device, equipment and storage medium

Info

Publication number: CN116631379B
Application number: CN202310889848.3A
Authority: CN
Inventors: 朱威; 王琅; 潘伟; 钟佳; 陈盛福
Original assignee: China Post Consumer Finance Co ltd
Current assignee: China Post Consumer Finance Co ltd
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-09-26
Anticipated expiration: 2043-07-20
Also published as: CN116631379A

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a voice recognition method, a device, equipment and a storage medium, wherein the method comprises the following steps: collecting original voice of a user, and preprocessing the original voice of the user to obtain an acoustic characteristic sequence; inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss; speech recognition of the user's original speech is accomplished based on the text sequence. The invention completes the voice recognition by the preset voice recognition model, the preset voice recognition model is trained based on CTC loss and AED loss, and the model with the minimum local loss is selected to carry out multi-batch average in the training process for construction, so that the voice recognition can be accurately carried out under the condition of less training sample data.

Description

Speech recognition method, device, equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech recognition.

Background

Today, with the increasing development of artificial intelligence technology, speech recognition technology, which is one of the artificial intelligence subdivision fields, is increasingly applied to more and more scenes.

Conventional speech recognition is typically implemented by inputting speech data into an existing speech recognition model and outputting text data. However, the speech recognition model used in such conventional speech recognition methods often needs to rely on a large amount of training sample data, and thus is only suitable for use in a scenario having a large amount of training sample data and training time. When such conventional speech recognition is applied to a scene having only a small number of training sample data (or a scene having only a small number of sample training due to device configuration), the accuracy of the speech recognition result is low. Therefore, there is a need in the industry for a method that can accurately perform speech recognition with less training sample data.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The invention mainly aims to provide a voice recognition method, a device, equipment and a storage medium, and aims to solve the technical problem that voice recognition cannot be accurately performed under the condition that training sample data are fewer in the prior art.

To achieve the above object, the present invention provides a voice recognition method comprising the steps of:

collecting original voice of a user, and preprocessing the original voice of the user to obtain an acoustic characteristic sequence;

inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss;

completing speech recognition of the original speech of the user based on the text sequence;

wherein the CTC penalty is for training tasks without aligned tag sequences, the AED penalty comprises a tag penalty for measuring the difference between the initial large speech model's predictions of output tag sequences during training and the tags, and an attention penalty for measuring the difference between the initial large speech model's generated attention weights during training and the expected attention weights.

Optionally, before the step of collecting the original voice of the user and preprocessing the original voice of the user to obtain the acoustic feature sequence, the method further includes:

screening training sample data from the historical prior data, and performing data cleaning on the training sample data to obtain training sample data after data cleaning;

dividing the training sample data after the data cleaning into a plurality of batches of samples based on batch size, wherein the batch size is the number size of the training sample data contained in the batch of samples;

training the initial large voice model based on the batch of samples to obtain a preset voice recognition model.

Optionally, the step of training the initial large voice model based on the batch of samples to obtain a preset voice recognition model includes:

inputting the batch of samples into an initial large voice model in sequence for training to obtain CTC loss and AED loss;

obtaining joint loss according to the CTC loss and the AED loss, and carrying out model average based on the joint loss to obtain a preset voice recognition model;

the calculation formula of the joint loss is as follows:

L _combined (x,y)=λL _CTC (x,y)+(1+λ)L _AED (x,y)；

wherein the L is _combined (x, y) represents the joint loss, the L _CTC (x, y) represents the CTC loss, the L _AED (x, y) represents the AED loss, x represents the acoustic signature in the batch of samples, y represents the label to which the acoustic signature corresponds, and λ represents the hyper-parameters that balance the CTC loss and the AED loss.

Optionally, the step of sequentially inputting the batch of samples into an initial large voice model for training to obtain CTC loss includes:

inputting the batch of samples into an initial large voice model for training, and obtaining CTC loss by calculating the probability of maximum alignment of correct labels in the training process, wherein the calculation formula of the CTC loss is as follows:

Loss_CTC=-log(ΣP(Y’|X,A))；

wherein the loss_ctc represents the CTC Loss, X represents a given input acoustic feature, Y' represents an alignment correct tag, a represents all alignment cases, and Σ represents a summation operation.

Optionally, the step of sequentially inputting the batch of samples into an initial large voice model for training to obtain AED losses includes:

inputting the batch of samples into an initial large voice model for training, and summing the label loss and the attention loss in the training process to obtain AED loss, wherein the calculation formula of the AED loss is as follows:

Loss_AED=Loss_Labal+Loss_Attention；

Loss_Labal=-Σ(log(P(y_i|Y)))；

Loss_Attention=λ*gradient_penalty+ε*|attention_weight-prior_weight|；

wherein the loss_aed represents the AED Loss, the loss_labal represents the label Loss, the loss_attention represents the Attention Loss, the y_i represents the i-th element in the label sequence Y, and the P (y_i|y) represents the probability that the initial large speech model generated the y_i; the gradient_penalty represents a gradient penalty term, the attention_weight-priority_weight represents an attention weight difference between an actual attention weight and a preset attention weight, and the lambda and the epsilon represent super-parameters for controlling the gradient penalty term and the attention weight difference, respectively.

Optionally, the step of obtaining a preset speech recognition model by performing model averaging based on the joint loss includes:

model sampling is carried out on the initial large voice model at intervals of preset batch quantity, and the sampled current model is stored;

obtaining a plurality of joint losses based on a sampling result, and screening two joint losses with the minimum loss value from the plurality of joint losses to respectively correspond to the epoch-a model and the epoch-b model;

and carrying out model average based on the epoch-a model and the epoch-b model to obtain a preset voice recognition model.

Optionally, the step of obtaining a preset speech recognition model based on model average of the epoch-a model and the epoch-b model includes:

respectively calculating the average value of sample points between the front and back i p samples in the epoch-a model and the epoch-b model to obtain a first average model_avg _{[p×(m+1,m-1)]} And a second average model_avg _{[p×(n+1,n-1)]} ；

Model average is carried out based on the first average value and the second average value, and a preset voice recognition model is obtained;

wherein m represents that the epoch-a model is subjected to an mth sampling, n represents that the epoch-b model is subjected to an nth sampling, and p represents the number of batch samples corresponding to each sampling in the epoch-a model and the epoch-b model.

In addition, to achieve the above object, the present invention also proposes a voice recognition apparatus including:

the voice processing module is used for collecting original voice of a user and preprocessing the original voice of the user to obtain an acoustic characteristic sequence;

the model output module is used for inputting the acoustic feature sequence into a preset voice recognition model so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, and the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss;

the voice recognition module is used for completing voice recognition of the original voice of the user based on the text sequence;

In addition, to achieve the above object, the present invention also proposes a voice recognition apparatus including: a memory, a processor, and a speech recognition program stored on the memory and executable on the processor, the speech recognition program configured to implement the steps of the speech recognition method as described above.

In addition, to achieve the above object, the present invention also proposes a computer-readable storage medium having stored thereon a speech recognition program which, when executed by a processor, implements the steps of the speech recognition method as described above.

The method comprises the steps of collecting original voice of a user, and preprocessing the original voice of the user to obtain an acoustic characteristic sequence; inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss; completing voice recognition of original voice of a user based on a text sequence; the CTC loss is used for a training task without aligning a tag sequence, the AED loss comprises a tag loss and an attention loss, the tag loss is used for measuring the difference between the output tag sequence prediction and the tag of the initial large voice model in the training process, and the attention loss is used for measuring the difference between the attention weight generated by the initial large voice model in the training process and the expected attention weight. Compared with the prior art, the method carries out voice recognition through the traditional voice recognition model, because the acoustic feature sequence obtained by preprocessing the original voice of the user is input into the preset voice recognition model, the preset voice recognition model is built after training based on CTC loss and AED loss, so that the text sequence is obtained to complete voice recognition, and after training is finished, model parameter sampling points float near the optimal point. By averaging these sample points that float around the optimal point, a model with lower noise (randomness), i.e., a model closer to the optimal point, can be obtained. The model with lower noise (randomness) can solve the problem that the average strategy of the existing voice recognition model can not effectively improve the recognition accuracy of the voice model in a small sample fine tuning scene, and can accurately perform voice recognition even under the condition of less training sample data.

Drawings

FIG. 1 is a schematic diagram of a speech recognition device in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart of a first embodiment of a speech recognition method according to the present invention;

FIG. 3 is a flowchart of a second embodiment of a speech recognition method according to the present invention;

FIG. 4 is a flowchart of a third embodiment of a speech recognition method according to the present invention;

fig. 5 is a block diagram of a first embodiment of a speech recognition device according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic diagram of a speech recognition device in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the voice recognition apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It will be appreciated by those skilled in the art that the structure shown in fig. 1 is not limiting of the speech recognition device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a voice recognition program may be included in the memory 1005, which is a computer-readable storage medium.

In the speech recognition device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the voice recognition apparatus of the present invention may be provided in a voice recognition apparatus that invokes a voice recognition program stored in the memory 1005 through the processor 1001 and performs the voice recognition method provided by the embodiment of the present invention.

Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of a speech recognition method according to the present invention.

In this embodiment, the voice recognition method includes the following steps:

step S10: and collecting original voice of a user, and preprocessing the original voice of the user to obtain an acoustic characteristic sequence.

It should be noted that, the execution body of the method of the present embodiment may be a computing service device with functions of data processing, network communication and program running, for example, a mobile phone, a tablet computer, a personal computer, etc., or may be other electronic devices capable of implementing the same or similar functions, which is not limited in this embodiment. Various embodiments of the speech recognition method of the present invention will be described herein by taking a speech recognition apparatus as an example.

It is understood that the user original voice may be voice data collected by a microphone or any other device capable of implementing a voice data collection function without any processing.

It should be appreciated that the acoustic feature sequence may be a series of digital representations extracted from a speech signal corresponding to the original speech of the user, for describing time-domain and frequency-domain features of the speech, such as mel-frequency cepstral coefficients, linear predictive coding, short-term energy, and zero-crossing rate, which is not limited in this embodiment.

In a specific implementation, the preprocessing may include a series of operations that can improve accuracy and robustness of speech recognition, such as denoising, audio gain adjustment, volume normalization, speech endpoint detection, speech enhancement, and the like, which is not limited in this embodiment.

Step S20: inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss.

Note that the CTC (Connectionist Temporal Classification, time-series class classification based on neural network) penalty is used for training tasks without aligned tag sequences, the AED (Attention-based encoder-decoder) penalty includes a tag penalty for measuring the difference between the initial large speech model's prediction of the output tag sequence during training and the tag, and an Attention penalty for measuring the difference between the Attention weight generated by the initial large speech model during training and the expected Attention weight.

It should be appreciated that the initial large speech model may be a hidden Markov model (Hidden Markov Model, HMM), a Gaussian mixture model (Gaussian Mixture Model, GMM), or other model capable of decoding acoustic feature sequences to achieve speech recognition, as this embodiment is not limited in this regard.

It should be appreciated that the text sequence described above may be the result of a local minimum model average based on training of small samples, which may be a sequence of characters, words, or other discrete units of text in a certain order.

Step S30: and completing the voice recognition of the original voice of the user based on the text sequence.

In a specific implementation, in a text sequence obtained after the acoustic feature sequence is decoded by the preset speech recognition model, the problems of repeated characters, wrong characters and the like may still exist, and at this time, a more accurate recognition result can be obtained by performing post-processing operation on the text sequence, so as to complete speech recognition of the original speech of the user.

Further, in this embodiment, in order to obtain a speech recognition model with lower noise (or randomness) and thus obtain an optimal speech recognition result, before step S10, the method may further include:

step S1: and screening training sample data from the historical prior data, and performing data cleaning on the training sample data to obtain the training sample data after data cleaning.

It should be noted that the historical prior data may be a voice data set (such as librispech, mozilla Common Voice, etc.) disclosed in the internet, or may be other correct cases of historical voice recognition, etc., which are not described herein.

In a specific implementation, the data cleaning can be implemented by removing duplicate data, processing missing values, solving data inconsistencies (such as misspellings, data format inconsistencies, etc.), and the like, so as to obtain training sample data after the data cleaning.

Step S2: the data-cleaned training sample data is divided into a plurality of batches of samples based on a batch size, the batch size being the number size of training sample data contained in the batch of samples.

It should be appreciated that since the present embodiment is directed to a speech recognition scenario in which the training samples are small, the batch size (i.e., batch size) may also be set relatively small, e.g., batch size=4.

Step S3: training the initial large voice model based on the batch of samples to obtain a preset voice recognition model.

In a specific implementation, the training process of the initial large voice model can be divided into a plurality of rounds (epochs) based on the batch of samples, and the stepwise change of the model loss corresponding to each epochs is observed, so that the model optimization direction corresponding to the next round is determined based on the model loss of the present round. Later in training, the model parameter sampling points will float around the optimal point. By averaging these samples floating near the loss optimum in parameter space, a model with lower noise (randomness), i.e., a model closer to the loss optimum, can be obtained. Therefore, each local loss minimum value is selected to be used as a sampling point for model averaging, and the model averaging result is the preset speech recognition model.

According to the embodiment, training sample data are screened from historical priori data, and data cleaning is carried out on the training sample data, so that training sample data after data cleaning are obtained; dividing the training sample data after data cleaning into a plurality of batches of samples based on the batch size, wherein the batch size is the number and the size of the training sample data contained in the batch of samples; training an initial large voice model based on a batch of samples to obtain a preset voice recognition model; collecting original voice of a user, and preprocessing the original voice of the user to obtain an acoustic characteristic sequence; inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss; completing voice recognition of original voice of a user based on a text sequence; the CTC loss is used for a training task without aligning a tag sequence, the AED loss comprises a tag loss and an attention loss, the tag loss is used for measuring the difference between the output tag sequence prediction and the tag of the initial large voice model in the training process, and the attention loss is used for measuring the difference between the attention weight generated by the initial large voice model in the training process and the expected attention weight. Compared with the prior art, the method for performing voice recognition through the traditional voice recognition model, because the acoustic feature sequence obtained by preprocessing the original voice of the user is input into the preset voice recognition model, the preset voice recognition model is built after training based on CTC loss and AED loss, so that the text sequence is obtained to complete voice recognition, the technical problem that the existing voice recognition method needs to rely on a large amount of training sample data and training time is solved, and further the voice recognition can be accurately performed under the condition that the training sample data are fewer.

Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of a speech recognition method according to the present invention.

Based on the above-mentioned first embodiment, in this embodiment, in order to continuously adjust the initial large speech model in the training process, so as to obtain a preset speech recognition model with higher speech recognition accuracy, the step S3 may include:

step S31: and sequentially inputting the batch of samples into an initial large voice model for training to obtain CTC loss and AED loss.

It will be appreciated that the CTC penalty described above can be used for a sequence-to-sequence task without alignment tags, which in this embodiment is trained by maximizing the probability of aligning the correct tag, the calculation of CTC penalty being based on the condition independent assumption that each position of the output tag depends only on a portion of the input features, and taking into account all possible alignments, so that the problem of length mismatch between input and output can be addressed.

It is understood that the AED losses described above may comprise two parts: one is a loss of output tag sequence and the other is a loss of attention weight. Label Loss (Label Loss): tag loss is used to measure the difference between the model's prediction of the output tag sequence at the Decoder stage and the tag, and a cross entropy loss function can be typically used to calculate the tag loss. Attention Loss (Attention Loss): the attention loss is used to measure the difference between the attention weight generated by the model at the Decoder stage and the expected attention weight.

Step S32: and obtaining joint loss according to the CTC loss and the AED loss, and carrying out model average based on the joint loss to obtain a preset voice recognition model.

In a specific implementation, the calculation formula of the joint loss may be:

L _combined (x,y)=λL _CTC (x,y)+(1+λ)L _AED (x,y)；

wherein L is as described above _combined (x, y) represents the joint loss, L _CTC (x, y) represents the CTC loss, L _AED (x, y) represents the AED loss, x represents the acoustic signature in the batch, y represents the label corresponding to the acoustic signature, and λ represents the hyper-parameter balancing the CTC loss and the AED loss.

Further, in this embodiment, in order to obtain CTC loss and AED loss more accurately, the step S31 may include:

step S311: and inputting the batch of samples into an initial large voice model for training, and obtaining CTC loss by calculating the probability of maximizing alignment of correct labels in the training process.

In a specific implementation, the calculation formula of CTC loss may be:

Loss_CTC=-log(ΣP(Y’|X,A))；

wherein, the loss_ctc represents the CTC Loss, X represents a given input acoustic feature, Y 'represents an alignment-correct label, a represents all alignment cases, Σ represents a summation operation, and P (Y' |x, a) represents the probability that the alignment-correct label obtained after the input acoustic feature appears in all alignment cases.

Step S312: the batch of samples is input into an initial large voice model for training, during which AED losses are obtained by summing the label losses and the attention losses.

In a specific implementation, the above-mentioned AED loss calculation formula is:

Loss_AED=Loss_Labal+Loss_Attention；

Loss_Labal=-Σ(log(P(y_i|Y)))；

Loss_Attention=λ*gradient_penalty+ε*|attention_weight-prior_weight|；

wherein the loss_aed represents the AED Loss, the loss_labal represents the label Loss, the loss_attention represents the Attention Loss, the y_i represents the i-th element in the label sequence Y, and the P (y_i|y) represents the probability that the initial large speech model generates the y_i; the gradient_penalty represents a gradient penalty term, the attention_weight-priority_weight represents an attention weight difference between an actual attention weight and a preset attention weight, and the λ and the ε represent super-parameters controlling the gradient penalty term and the attention weight difference, respectively.

According to the embodiment, batch samples are input into an initial large voice model for training, and CTC loss is obtained by calculating the probability of maximum alignment of correct labels in the training process; inputting the batch samples into an initial large voice model for training, and summing the label loss and the attention loss to obtain AED loss in the training process; and obtaining joint loss according to the CTC loss and the AED loss, and carrying out model average based on the joint loss to obtain a preset voice recognition model. Compared with the existing speech recognition model, the method of the embodiment obtains the joint loss according to the CTC loss and the AED loss, and adjusts the model parameters in the initial large speech model based on the joint loss, so that the preset speech recognition model with higher speech recognition accuracy can be obtained, and the reliability of the speech recognition result is further improved.

Referring to fig. 4, fig. 4 is a flowchart illustrating a third embodiment of a speech recognition method according to the present invention.

Based on the above embodiments, in this embodiment, in order to better save and compare models in different training phases, so as to select a model with smaller loss and better result from the models, the step S32 may include:

step S321: and carrying out model sampling on the initial large voice model at intervals of preset batch numbers, and storing the sampled current model.

It should be noted that the preset number of batches may be any non-zero natural number.

Step S322: and obtaining a plurality of joint losses based on the sampling result, and screening two joint losses with the minimum loss value from the plurality of joint losses to respectively correspond to the epoch-a model and the epoch-b model.

In a specific implementation, the model parameter sampling points float near the optimal point at the later stage of model training in this embodiment. By averaging these sampling points floating around the loss optimum in parameter space, a model with lower noise (randomness), i.e. a model closer to the loss optimum, can be obtained. Thus, the present embodiment may select each local loss value minimum to be used as a sampling point for model averaging.

Step S323: and carrying out model average based on the epoch-a model and the epoch-b model to obtain a preset voice recognition model.

In a specific implementation, the model may be based on the above-described epoch-a model and the above-described epoch-b model.

Further, in this embodiment, in order to obtain a preset speech recognition model with lower noise (or randomness), so as to improve the recognition accuracy of the speech recognition in this embodiment, the step S323 may include:

step S3231: respectively calculating the average value of sample points between the front and back i p samples in the epoch-a model and the epoch-b model to obtain a first average model_avg _{[p×(m+1,m-1)]} And a second average model_avg _{[p×(n+1,n-1)]} 。

In the expression of the first average value and the second average value, m represents that the epoch-a model is sampled m times, n represents that the epoch-b model is sampled n times, and p represents the number of samples of the batch corresponding to each of the epoch-a model and the epoch-b model.

Step S3232: and carrying out model average based on the first average value and the second average value to obtain a preset voice recognition model.

In a specific implementation, the gradient may be calculated by a back propagation algorithm and the parameters of the initial large speech model may be updated to perform optimization adjustment, so as to obtain the preset speech recognition model.

In the embodiment, model sampling is carried out on an initial large voice model at intervals of preset batch numbers, and a sampled current model is stored; obtaining a plurality of joint losses based on a sampling result, and screening two joint losses with the minimum loss value from the plurality of joint losses to respectively correspond to the epoch-a model and the epoch-b model; respectively calculating the average value of sample points between the front and back i p samples in the epoch-a model and the epoch-b model to obtain a first average model_avg _{[p×(m+1,m-1)]} And a second average model_avg _{[p×(n+1,n-1)]} The method comprises the steps of carrying out a first treatment on the surface of the Model average is carried out based on the first mean value and the second mean value, and a preset voice recognition model is obtained; wherein m represents that the epoch-a model is subjected to the mth sampling, n represents that the epoch-b model is subjected to the nth sampling, and p represents the number of batch samples corresponding to each sampling in the epoch-a model and the epoch-b model. Compared with the existing voice recognition method, the method in the embodiment can obtain the preset voice recognition model with lower noise (namely randomness) by averaging the model with the minimum screened joint loss (namely the floating sampling point near the optimal point), so that the recognition accuracy of the voice recognition in the embodiment is improved.

Furthermore, an embodiment of the present invention also proposes a computer-readable storage medium, on which a speech recognition program is stored, which, when executed by a processor, implements the steps of the speech recognition method as described above.

Referring to fig. 5, fig. 5 is a block diagram showing the structure of a first embodiment of the speech recognition apparatus according to the present invention.

As shown in fig. 5, a voice recognition apparatus according to an embodiment of the present invention includes:

the voice processing module 501 is used for collecting original voice of a user and preprocessing the original voice of the user to obtain an acoustic feature sequence;

the model output module 502 is configured to input the acoustic feature sequence into a preset speech recognition model, so that the preset speech recognition model decodes the acoustic feature sequence to obtain a text sequence, where the preset speech recognition model is a model obtained by training an initial large speech model based on CTC loss and AED loss;

a speech recognition module 503, configured to complete speech recognition of the original speech of the user based on the text sequence;

According to the embodiment, the original voice of the user is collected and preprocessed, so that an acoustic characteristic sequence is obtained; inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss; completing voice recognition of original voice of a user based on a text sequence; the CTC loss is used for a training task without aligning a tag sequence, the AED loss comprises a tag loss and an attention loss, the tag loss is used for measuring the difference between the output tag sequence prediction and the tag of the initial large voice model in the training process, and the attention loss is used for measuring the difference between the attention weight generated by the initial large voice model in the training process and the expected attention weight. Compared with the prior art, the method for performing voice recognition through the traditional voice recognition model, because the acoustic feature sequence obtained by preprocessing the original voice of the user is input into the preset voice recognition model, the preset voice recognition model is built after training based on CTC loss and AED loss, so that the text sequence is obtained to complete voice recognition, the technical problem that the existing voice recognition method needs to rely on a large amount of training sample data and training time is solved, and further the voice recognition can be accurately performed under the condition that the training sample data are fewer.

Based on the above-described first embodiment of the speech recognition device of the present invention, a second embodiment of the speech recognition device of the present invention is presented.

In this embodiment, the speech processing module 502 is further configured to screen training sample data from historical prior data, and perform data cleaning on the training sample data to obtain training sample data after data cleaning; dividing the training sample data after the data cleaning into a plurality of batches of samples based on batch size, wherein the batch size is the number size of the training sample data contained in the batch of samples; training the initial large voice model based on the batch of samples to obtain a preset voice recognition model.

Further, the speech processing module 502 is further configured to sequentially input the batch of samples into an initial large speech model for training, so as to obtain CTC loss and AED loss; obtaining joint loss according to the CTC loss and the AED loss, and carrying out model average based on the joint loss to obtain a preset voice recognition model; the calculation formula of the joint loss is as follows: l (L) _combined (x,y)=λL _CTC (x,y)+(1+λ)L _AED (x, y); wherein the L is _combined (x, y) represents the joint loss, the L _CTC (x, y) represents the CTC loss, the L _AED (x, y) represents the AED loss, x represents the acoustic signature in the batch of samples, y represents the label to which the acoustic signature corresponds, and λ represents the hyper-parameters that balance the CTC loss and the AED loss.

Further, the speech processing module 502 is further configured to input the batch of samples into an initial large speech model for training, and calculate a probability of maximizing alignment of a correct label during training to obtain a CTC loss, where a calculation formula of the CTC loss is: loss_ctc= -log (Σp (Y' |x, a)); wherein the loss_ctc represents the CTC Loss, X represents a given input acoustic feature, Y' represents an alignment correct tag, a represents all alignment cases, and Σ represents a summation operation.

Further, the speech processing module 502 is further configured to input the batch of samples into an initial large speech model for training, and obtain an AED loss by summing the tag loss and the attention loss during training, where a calculation formula of the AED loss is: loss_aed = loss_labal+loss_attribute; loss_labal= - Σ (log (P (y_i|y))); loss_attention=λ×gradient_duty+ε×attention_weight-priority_weight|; wherein the loss_aed represents the AED Loss, the loss_labal represents the label Loss, the loss_attention represents the Attention Loss, the y_i represents the i-th element in the label sequence Y, and the P (y_i|y) represents the probability that the initial large speech model generated the y_i; the gradient_penalty represents a gradient penalty term, the attention_weight-priority_weight represents an attention weight difference between an actual attention weight and a preset attention weight, and the lambda and the epsilon represent super-parameters for controlling the gradient penalty term and the attention weight difference, respectively.

Further, the speech processing module 502 is further configured to sample the initial large speech model at intervals of a preset batch, and store the sampled current model; obtaining a plurality of joint losses based on a sampling result, and screening two joint losses with the minimum loss value from the plurality of joint losses to respectively correspond to the epoch-a model and the epoch-b model; and carrying out model average based on the epoch-a model and the epoch-b model to obtain a preset voice recognition model.

Further, the speech processing module 502 is further configured to calculate sample point averages between the previous and subsequent i p samples in the epoch-a model and the epoch-b model, respectively, to obtain a first average model_avg _{[p×(m+1,m-1)]} And a second average model_avg _{[p×(n+1,n-1)]} The method comprises the steps of carrying out a first treatment on the surface of the Model average is carried out based on the first average value and the second average value, and a preset voice recognition model is obtained; wherein m represents the m-th sampling of the epoch-a model, n represents the n-th sampling of the epoch-b model, and p represents the epoch-a model and the epoch-b modelA corresponding number of batch samples for each sample.

Other embodiments or specific implementations of the speech recognition device of the present invention may refer to the above method embodiments, and are not described herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of embodiments, it will be clear to a person skilled in the art that the above embodiment method may be implemented by means of software plus a necessary general hardware platform, but may of course also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a computer readable storage medium (e.g. read only memory/random access memory, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method of speech recognition, the method comprising the steps of:

wherein the CTC penalty is for training tasks without aligned tag sequences, the AED penalty comprises a tag penalty for measuring the difference between the initial large speech model's predictions of output tag sequences during training and tags, and an attention penalty for measuring the difference between the initial large speech model's generated attention weights during training and the expected attention weights;

the step of collecting the original voice of the user and preprocessing the original voice of the user to obtain an acoustic characteristic sequence is preceded by the following steps:

Loss_CTC=-log(ΣP(Y’|X,A))；

wherein the loss_ctc represents the CTC Loss, X represents a given input acoustic feature, Y' represents an alignment correct tag, a represents all alignment cases, and Σ represents a summation operation;

Loss_AED=Loss_Labal+Loss_Attention；

Loss_Labal=-Σ(log(P(y_i|Y)))；

Loss_Attention=λ*gradient_penalty+ε*|attention_weight-prior_weight|；

wherein the loss_aed represents the AED Loss, the loss_labal represents the label Loss, the loss_attention represents the Attention Loss, the y_i represents the i-th element in the label sequence Y, and the P (y_i|y) represents the probability that the initial large speech model generated the y_i; the gradient_penalty represents a gradient penalty term, the attention_weight-priority_weight represents an attention weight difference between an actual attention weight and a preset attention weight, and the lambda and the epsilon respectively represent super-parameters for controlling the gradient penalty term and the attention weight difference;

the calculation formula of the joint loss is as follows:

L _combined (x,y)=λL _CTC (x,y)+(1+λ)L _AED (x,y)；

wherein the L is _combined (x, y) represents the joint loss, the L _CTC (x, y) represents the CTC loss, the L _AED (x, y) representing the AED loss, the x representing the acoustic signature in the batch of samples, the y representing the label to which the acoustic signature corresponds, the λ representing the hyper-parameter balancing the CTC loss and the AED loss;

the step of obtaining a preset speech recognition model by performing model average based on the joint loss comprises the following steps:

2. A speech recognition apparatus based on the speech recognition method of claim 1, characterized in that the speech recognition apparatus comprises:

3. A speech recognition device, the device comprising: a memory, a processor, and a speech recognition program stored on the memory and executable on the processor, the speech recognition program configured to implement the steps of the speech recognition method of claim 1.

4. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a speech recognition program which, when executed by a processor, implements the steps of the speech recognition method according to claim 1.