CN116631379B - Speech recognition method, device, equipment and storage medium - Google Patents

Speech recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN116631379B
CN116631379B CN202310889848.3A CN202310889848A CN116631379B CN 116631379 B CN116631379 B CN 116631379B CN 202310889848 A CN202310889848 A CN 202310889848A CN 116631379 B CN116631379 B CN 116631379B
Authority
CN
China
Prior art keywords
model
loss
aed
training
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310889848.3A
Other languages
Chinese (zh)
Other versions
CN116631379A (en
Inventor
朱威
王琅
潘伟
钟佳
陈盛福
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Post Consumer Finance Co ltd
Original Assignee
China Post Consumer Finance Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Post Consumer Finance Co ltd filed Critical China Post Consumer Finance Co ltd
Priority to CN202310889848.3A priority Critical patent/CN116631379B/en
Publication of CN116631379A publication Critical patent/CN116631379A/en
Application granted granted Critical
Publication of CN116631379B publication Critical patent/CN116631379B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a voice recognition method, a device, equipment and a storage medium, wherein the method comprises the following steps: collecting original voice of a user, and preprocessing the original voice of the user to obtain an acoustic characteristic sequence; inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss; speech recognition of the user's original speech is accomplished based on the text sequence. The invention completes the voice recognition by the preset voice recognition model, the preset voice recognition model is trained based on CTC loss and AED loss, and the model with the minimum local loss is selected to carry out multi-batch average in the training process for construction, so that the voice recognition can be accurately carried out under the condition of less training sample data.

Description

Speech recognition method, device, equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech recognition.
Background
Today, with the increasing development of artificial intelligence technology, speech recognition technology, which is one of the artificial intelligence subdivision fields, is increasingly applied to more and more scenes.
Conventional speech recognition is typically implemented by inputting speech data into an existing speech recognition model and outputting text data. However, the speech recognition model used in such conventional speech recognition methods often needs to rely on a large amount of training sample data, and thus is only suitable for use in a scenario having a large amount of training sample data and training time. When such conventional speech recognition is applied to a scene having only a small number of training sample data (or a scene having only a small number of sample training due to device configuration), the accuracy of the speech recognition result is low. Therefore, there is a need in the industry for a method that can accurately perform speech recognition with less training sample data.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The invention mainly aims to provide a voice recognition method, a device, equipment and a storage medium, and aims to solve the technical problem that voice recognition cannot be accurately performed under the condition that training sample data are fewer in the prior art.
To achieve the above object, the present invention provides a voice recognition method comprising the steps of:
collecting original voice of a user, and preprocessing the original voice of the user to obtain an acoustic characteristic sequence;
inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss;
completing speech recognition of the original speech of the user based on the text sequence;
wherein the CTC penalty is for training tasks without aligned tag sequences, the AED penalty comprises a tag penalty for measuring the difference between the initial large speech model's predictions of output tag sequences during training and the tags, and an attention penalty for measuring the difference between the initial large speech model's generated attention weights during training and the expected attention weights.
Optionally, before the step of collecting the original voice of the user and preprocessing the original voice of the user to obtain the acoustic feature sequence, the method further includes:
screening training sample data from the historical prior data, and performing data cleaning on the training sample data to obtain training sample data after data cleaning;
dividing the training sample data after the data cleaning into a plurality of batches of samples based on batch size, wherein the batch size is the number size of the training sample data contained in the batch of samples;
training the initial large voice model based on the batch of samples to obtain a preset voice recognition model.
Optionally, the step of training the initial large voice model based on the batch of samples to obtain a preset voice recognition model includes:
inputting the batch of samples into an initial large voice model in sequence for training to obtain CTC loss and AED loss;
obtaining joint loss according to the CTC loss and the AED loss, and carrying out model average based on the joint loss to obtain a preset voice recognition model;
the calculation formula of the joint loss is as follows:
L combined (x,y)=λL CTC (x,y)+(1+λ)L AED (x,y);
wherein the L is combined (x, y) represents the joint loss, the L CTC (x, y) represents the CTC loss, the L AED (x, y) represents the AED loss, x represents the acoustic signature in the batch of samples, y represents the label to which the acoustic signature corresponds, and λ represents the hyper-parameters that balance the CTC loss and the AED loss.
Optionally, the step of sequentially inputting the batch of samples into an initial large voice model for training to obtain CTC loss includes:
inputting the batch of samples into an initial large voice model for training, and obtaining CTC loss by calculating the probability of maximum alignment of correct labels in the training process, wherein the calculation formula of the CTC loss is as follows:
Loss_CTC=-log(ΣP(Y’|X,A));
wherein the loss_ctc represents the CTC Loss, X represents a given input acoustic feature, Y' represents an alignment correct tag, a represents all alignment cases, and Σ represents a summation operation.
Optionally, the step of sequentially inputting the batch of samples into an initial large voice model for training to obtain AED losses includes:
inputting the batch of samples into an initial large voice model for training, and summing the label loss and the attention loss in the training process to obtain AED loss, wherein the calculation formula of the AED loss is as follows:
Loss_AED=Loss_Labal+Loss_Attention;
Loss_Labal=-Σ(log(P(y_i|Y)));
Loss_Attention=λ*gradient_penalty+ε*|attention_weight-prior_weight|;
wherein the loss_aed represents the AED Loss, the loss_labal represents the label Loss, the loss_attention represents the Attention Loss, the y_i represents the i-th element in the label sequence Y, and the P (y_i|y) represents the probability that the initial large speech model generated the y_i; the gradient_penalty represents a gradient penalty term, the attention_weight-priority_weight represents an attention weight difference between an actual attention weight and a preset attention weight, and the lambda and the epsilon represent super-parameters for controlling the gradient penalty term and the attention weight difference, respectively.
Optionally, the step of obtaining a preset speech recognition model by performing model averaging based on the joint loss includes:
model sampling is carried out on the initial large voice model at intervals of preset batch quantity, and the sampled current model is stored;
obtaining a plurality of joint losses based on a sampling result, and screening two joint losses with the minimum loss value from the plurality of joint losses to respectively correspond to the epoch-a model and the epoch-b model;
and carrying out model average based on the epoch-a model and the epoch-b model to obtain a preset voice recognition model.
Optionally, the step of obtaining a preset speech recognition model based on model average of the epoch-a model and the epoch-b model includes:
respectively calculating the average value of sample points between the front and back i p samples in the epoch-a model and the epoch-b model to obtain a first average model_avg [p×(m+1,m-1)] And a second average model_avg [p×(n+1,n-1)]
Model average is carried out based on the first average value and the second average value, and a preset voice recognition model is obtained;
wherein m represents that the epoch-a model is subjected to an mth sampling, n represents that the epoch-b model is subjected to an nth sampling, and p represents the number of batch samples corresponding to each sampling in the epoch-a model and the epoch-b model.
In addition, to achieve the above object, the present invention also proposes a voice recognition apparatus including:
the voice processing module is used for collecting original voice of a user and preprocessing the original voice of the user to obtain an acoustic characteristic sequence;
the model output module is used for inputting the acoustic feature sequence into a preset voice recognition model so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, and the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss;
the voice recognition module is used for completing voice recognition of the original voice of the user based on the text sequence;
wherein the CTC penalty is for training tasks without aligned tag sequences, the AED penalty comprises a tag penalty for measuring the difference between the initial large speech model's predictions of output tag sequences during training and the tags, and an attention penalty for measuring the difference between the initial large speech model's generated attention weights during training and the expected attention weights.
In addition, to achieve the above object, the present invention also proposes a voice recognition apparatus including: a memory, a processor, and a speech recognition program stored on the memory and executable on the processor, the speech recognition program configured to implement the steps of the speech recognition method as described above.
In addition, to achieve the above object, the present invention also proposes a computer-readable storage medium having stored thereon a speech recognition program which, when executed by a processor, implements the steps of the speech recognition method as described above.
The method comprises the steps of collecting original voice of a user, and preprocessing the original voice of the user to obtain an acoustic characteristic sequence; inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss; completing voice recognition of original voice of a user based on a text sequence; the CTC loss is used for a training task without aligning a tag sequence, the AED loss comprises a tag loss and an attention loss, the tag loss is used for measuring the difference between the output tag sequence prediction and the tag of the initial large voice model in the training process, and the attention loss is used for measuring the difference between the attention weight generated by the initial large voice model in the training process and the expected attention weight. Compared with the prior art, the method carries out voice recognition through the traditional voice recognition model, because the acoustic feature sequence obtained by preprocessing the original voice of the user is input into the preset voice recognition model, the preset voice recognition model is built after training based on CTC loss and AED loss, so that the text sequence is obtained to complete voice recognition, and after training is finished, model parameter sampling points float near the optimal point. By averaging these sample points that float around the optimal point, a model with lower noise (randomness), i.e., a model closer to the optimal point, can be obtained. The model with lower noise (randomness) can solve the problem that the average strategy of the existing voice recognition model can not effectively improve the recognition accuracy of the voice model in a small sample fine tuning scene, and can accurately perform voice recognition even under the condition of less training sample data.
Drawings
FIG. 1 is a schematic diagram of a speech recognition device in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart of a first embodiment of a speech recognition method according to the present invention;
FIG. 3 is a flowchart of a second embodiment of a speech recognition method according to the present invention;
FIG. 4 is a flowchart of a third embodiment of a speech recognition method according to the present invention;
fig. 5 is a block diagram of a first embodiment of a speech recognition device according to the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic diagram of a speech recognition device in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the voice recognition apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the structure shown in fig. 1 is not limiting of the speech recognition device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a voice recognition program may be included in the memory 1005, which is a computer-readable storage medium.
In the speech recognition device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the voice recognition apparatus of the present invention may be provided in a voice recognition apparatus that invokes a voice recognition program stored in the memory 1005 through the processor 1001 and performs the voice recognition method provided by the embodiment of the present invention.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of a speech recognition method according to the present invention.
In this embodiment, the voice recognition method includes the following steps:
step S10: and collecting original voice of a user, and preprocessing the original voice of the user to obtain an acoustic characteristic sequence.
It should be noted that, the execution body of the method of the present embodiment may be a computing service device with functions of data processing, network communication and program running, for example, a mobile phone, a tablet computer, a personal computer, etc., or may be other electronic devices capable of implementing the same or similar functions, which is not limited in this embodiment. Various embodiments of the speech recognition method of the present invention will be described herein by taking a speech recognition apparatus as an example.
It is understood that the user original voice may be voice data collected by a microphone or any other device capable of implementing a voice data collection function without any processing.
It should be appreciated that the acoustic feature sequence may be a series of digital representations extracted from a speech signal corresponding to the original speech of the user, for describing time-domain and frequency-domain features of the speech, such as mel-frequency cepstral coefficients, linear predictive coding, short-term energy, and zero-crossing rate, which is not limited in this embodiment.
In a specific implementation, the preprocessing may include a series of operations that can improve accuracy and robustness of speech recognition, such as denoising, audio gain adjustment, volume normalization, speech endpoint detection, speech enhancement, and the like, which is not limited in this embodiment.
Step S20: inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss.
Note that the CTC (Connectionist Temporal Classification, time-series class classification based on neural network) penalty is used for training tasks without aligned tag sequences, the AED (Attention-based encoder-decoder) penalty includes a tag penalty for measuring the difference between the initial large speech model's prediction of the output tag sequence during training and the tag, and an Attention penalty for measuring the difference between the Attention weight generated by the initial large speech model during training and the expected Attention weight.
It should be appreciated that the initial large speech model may be a hidden Markov model (Hidden Markov Model, HMM), a Gaussian mixture model (Gaussian Mixture Model, GMM), or other model capable of decoding acoustic feature sequences to achieve speech recognition, as this embodiment is not limited in this regard.
It should be appreciated that the text sequence described above may be the result of a local minimum model average based on training of small samples, which may be a sequence of characters, words, or other discrete units of text in a certain order.
Step S30: and completing the voice recognition of the original voice of the user based on the text sequence.
In a specific implementation, in a text sequence obtained after the acoustic feature sequence is decoded by the preset speech recognition model, the problems of repeated characters, wrong characters and the like may still exist, and at this time, a more accurate recognition result can be obtained by performing post-processing operation on the text sequence, so as to complete speech recognition of the original speech of the user.
Further, in this embodiment, in order to obtain a speech recognition model with lower noise (or randomness) and thus obtain an optimal speech recognition result, before step S10, the method may further include:
step S1: and screening training sample data from the historical prior data, and performing data cleaning on the training sample data to obtain the training sample data after data cleaning.
It should be noted that the historical prior data may be a voice data set (such as librispech, mozilla Common Voice, etc.) disclosed in the internet, or may be other correct cases of historical voice recognition, etc., which are not described herein.
In a specific implementation, the data cleaning can be implemented by removing duplicate data, processing missing values, solving data inconsistencies (such as misspellings, data format inconsistencies, etc.), and the like, so as to obtain training sample data after the data cleaning.
Step S2: the data-cleaned training sample data is divided into a plurality of batches of samples based on a batch size, the batch size being the number size of training sample data contained in the batch of samples.
It should be appreciated that since the present embodiment is directed to a speech recognition scenario in which the training samples are small, the batch size (i.e., batch size) may also be set relatively small, e.g., batch size=4.
Step S3: training the initial large voice model based on the batch of samples to obtain a preset voice recognition model.
In a specific implementation, the training process of the initial large voice model can be divided into a plurality of rounds (epochs) based on the batch of samples, and the stepwise change of the model loss corresponding to each epochs is observed, so that the model optimization direction corresponding to the next round is determined based on the model loss of the present round. Later in training, the model parameter sampling points will float around the optimal point. By averaging these samples floating near the loss optimum in parameter space, a model with lower noise (randomness), i.e., a model closer to the loss optimum, can be obtained. Therefore, each local loss minimum value is selected to be used as a sampling point for model averaging, and the model averaging result is the preset speech recognition model.
According to the embodiment, training sample data are screened from historical priori data, and data cleaning is carried out on the training sample data, so that training sample data after data cleaning are obtained; dividing the training sample data after data cleaning into a plurality of batches of samples based on the batch size, wherein the batch size is the number and the size of the training sample data contained in the batch of samples; training an initial large voice model based on a batch of samples to obtain a preset voice recognition model; collecting original voice of a user, and preprocessing the original voice of the user to obtain an acoustic characteristic sequence; inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss; completing voice recognition of original voice of a user based on a text sequence; the CTC loss is used for a training task without aligning a tag sequence, the AED loss comprises a tag loss and an attention loss, the tag loss is used for measuring the difference between the output tag sequence prediction and the tag of the initial large voice model in the training process, and the attention loss is used for measuring the difference between the attention weight generated by the initial large voice model in the training process and the expected attention weight. Compared with the prior art, the method for performing voice recognition through the traditional voice recognition model, because the acoustic feature sequence obtained by preprocessing the original voice of the user is input into the preset voice recognition model, the preset voice recognition model is built after training based on CTC loss and AED loss, so that the text sequence is obtained to complete voice recognition, the technical problem that the existing voice recognition method needs to rely on a large amount of training sample data and training time is solved, and further the voice recognition can be accurately performed under the condition that the training sample data are fewer.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of a speech recognition method according to the present invention.
Based on the above-mentioned first embodiment, in this embodiment, in order to continuously adjust the initial large speech model in the training process, so as to obtain a preset speech recognition model with higher speech recognition accuracy, the step S3 may include:
step S31: and sequentially inputting the batch of samples into an initial large voice model for training to obtain CTC loss and AED loss.
It will be appreciated that the CTC penalty described above can be used for a sequence-to-sequence task without alignment tags, which in this embodiment is trained by maximizing the probability of aligning the correct tag, the calculation of CTC penalty being based on the condition independent assumption that each position of the output tag depends only on a portion of the input features, and taking into account all possible alignments, so that the problem of length mismatch between input and output can be addressed.
It is understood that the AED losses described above may comprise two parts: one is a loss of output tag sequence and the other is a loss of attention weight. Label Loss (Label Loss): tag loss is used to measure the difference between the model's prediction of the output tag sequence at the Decoder stage and the tag, and a cross entropy loss function can be typically used to calculate the tag loss. Attention Loss (Attention Loss): the attention loss is used to measure the difference between the attention weight generated by the model at the Decoder stage and the expected attention weight.
Step S32: and obtaining joint loss according to the CTC loss and the AED loss, and carrying out model average based on the joint loss to obtain a preset voice recognition model.
In a specific implementation, the calculation formula of the joint loss may be:
L combined (x,y)=λL CTC (x,y)+(1+λ)L AED (x,y);
wherein L is as described above combined (x, y) represents the joint loss, L CTC (x, y) represents the CTC loss, L AED (x, y) represents the AED loss, x represents the acoustic signature in the batch, y represents the label corresponding to the acoustic signature, and λ represents the hyper-parameter balancing the CTC loss and the AED loss.
Further, in this embodiment, in order to obtain CTC loss and AED loss more accurately, the step S31 may include:
step S311: and inputting the batch of samples into an initial large voice model for training, and obtaining CTC loss by calculating the probability of maximizing alignment of correct labels in the training process.
In a specific implementation, the calculation formula of CTC loss may be:
Loss_CTC=-log(ΣP(Y’|X,A));
wherein, the loss_ctc represents the CTC Loss, X represents a given input acoustic feature, Y 'represents an alignment-correct label, a represents all alignment cases, Σ represents a summation operation, and P (Y' |x, a) represents the probability that the alignment-correct label obtained after the input acoustic feature appears in all alignment cases.
Step S312: the batch of samples is input into an initial large voice model for training, during which AED losses are obtained by summing the label losses and the attention losses.
In a specific implementation, the above-mentioned AED loss calculation formula is:
Loss_AED=Loss_Labal+Loss_Attention;
Loss_Labal=-Σ(log(P(y_i|Y)));
Loss_Attention=λ*gradient_penalty+ε*|attention_weight-prior_weight|;
wherein the loss_aed represents the AED Loss, the loss_labal represents the label Loss, the loss_attention represents the Attention Loss, the y_i represents the i-th element in the label sequence Y, and the P (y_i|y) represents the probability that the initial large speech model generates the y_i; the gradient_penalty represents a gradient penalty term, the attention_weight-priority_weight represents an attention weight difference between an actual attention weight and a preset attention weight, and the λ and the ε represent super-parameters controlling the gradient penalty term and the attention weight difference, respectively.
According to the embodiment, batch samples are input into an initial large voice model for training, and CTC loss is obtained by calculating the probability of maximum alignment of correct labels in the training process; inputting the batch samples into an initial large voice model for training, and summing the label loss and the attention loss to obtain AED loss in the training process; and obtaining joint loss according to the CTC loss and the AED loss, and carrying out model average based on the joint loss to obtain a preset voice recognition model. Compared with the existing speech recognition model, the method of the embodiment obtains the joint loss according to the CTC loss and the AED loss, and adjusts the model parameters in the initial large speech model based on the joint loss, so that the preset speech recognition model with higher speech recognition accuracy can be obtained, and the reliability of the speech recognition result is further improved.
Referring to fig. 4, fig. 4 is a flowchart illustrating a third embodiment of a speech recognition method according to the present invention.
Based on the above embodiments, in this embodiment, in order to better save and compare models in different training phases, so as to select a model with smaller loss and better result from the models, the step S32 may include:
step S321: and carrying out model sampling on the initial large voice model at intervals of preset batch numbers, and storing the sampled current model.
It should be noted that the preset number of batches may be any non-zero natural number.
Step S322: and obtaining a plurality of joint losses based on the sampling result, and screening two joint losses with the minimum loss value from the plurality of joint losses to respectively correspond to the epoch-a model and the epoch-b model.
In a specific implementation, the model parameter sampling points float near the optimal point at the later stage of model training in this embodiment. By averaging these sampling points floating around the loss optimum in parameter space, a model with lower noise (randomness), i.e. a model closer to the loss optimum, can be obtained. Thus, the present embodiment may select each local loss value minimum to be used as a sampling point for model averaging.
Step S323: and carrying out model average based on the epoch-a model and the epoch-b model to obtain a preset voice recognition model.
In a specific implementation, the model may be based on the above-described epoch-a model and the above-described epoch-b model.
Further, in this embodiment, in order to obtain a preset speech recognition model with lower noise (or randomness), so as to improve the recognition accuracy of the speech recognition in this embodiment, the step S323 may include:
step S3231: respectively calculating the average value of sample points between the front and back i p samples in the epoch-a model and the epoch-b model to obtain a first average model_avg [p×(m+1,m-1)] And a second average model_avg [p×(n+1,n-1)]
In the expression of the first average value and the second average value, m represents that the epoch-a model is sampled m times, n represents that the epoch-b model is sampled n times, and p represents the number of samples of the batch corresponding to each of the epoch-a model and the epoch-b model.
Step S3232: and carrying out model average based on the first average value and the second average value to obtain a preset voice recognition model.
In a specific implementation, the gradient may be calculated by a back propagation algorithm and the parameters of the initial large speech model may be updated to perform optimization adjustment, so as to obtain the preset speech recognition model.
In the embodiment, model sampling is carried out on an initial large voice model at intervals of preset batch numbers, and a sampled current model is stored; obtaining a plurality of joint losses based on a sampling result, and screening two joint losses with the minimum loss value from the plurality of joint losses to respectively correspond to the epoch-a model and the epoch-b model; respectively calculating the average value of sample points between the front and back i p samples in the epoch-a model and the epoch-b model to obtain a first average model_avg [p×(m+1,m-1)] And a second average model_avg [p×(n+1,n-1)] The method comprises the steps of carrying out a first treatment on the surface of the Model average is carried out based on the first mean value and the second mean value, and a preset voice recognition model is obtained; wherein m represents that the epoch-a model is subjected to the mth sampling, n represents that the epoch-b model is subjected to the nth sampling, and p represents the number of batch samples corresponding to each sampling in the epoch-a model and the epoch-b model. Compared with the existing voice recognition method, the method in the embodiment can obtain the preset voice recognition model with lower noise (namely randomness) by averaging the model with the minimum screened joint loss (namely the floating sampling point near the optimal point), so that the recognition accuracy of the voice recognition in the embodiment is improved.
Furthermore, an embodiment of the present invention also proposes a computer-readable storage medium, on which a speech recognition program is stored, which, when executed by a processor, implements the steps of the speech recognition method as described above.
Referring to fig. 5, fig. 5 is a block diagram showing the structure of a first embodiment of the speech recognition apparatus according to the present invention.
As shown in fig. 5, a voice recognition apparatus according to an embodiment of the present invention includes:
the voice processing module 501 is used for collecting original voice of a user and preprocessing the original voice of the user to obtain an acoustic feature sequence;
the model output module 502 is configured to input the acoustic feature sequence into a preset speech recognition model, so that the preset speech recognition model decodes the acoustic feature sequence to obtain a text sequence, where the preset speech recognition model is a model obtained by training an initial large speech model based on CTC loss and AED loss;
a speech recognition module 503, configured to complete speech recognition of the original speech of the user based on the text sequence;
wherein the CTC penalty is for training tasks without aligned tag sequences, the AED penalty comprises a tag penalty for measuring the difference between the initial large speech model's predictions of output tag sequences during training and the tags, and an attention penalty for measuring the difference between the initial large speech model's generated attention weights during training and the expected attention weights.
According to the embodiment, the original voice of the user is collected and preprocessed, so that an acoustic characteristic sequence is obtained; inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss; completing voice recognition of original voice of a user based on a text sequence; the CTC loss is used for a training task without aligning a tag sequence, the AED loss comprises a tag loss and an attention loss, the tag loss is used for measuring the difference between the output tag sequence prediction and the tag of the initial large voice model in the training process, and the attention loss is used for measuring the difference between the attention weight generated by the initial large voice model in the training process and the expected attention weight. Compared with the prior art, the method for performing voice recognition through the traditional voice recognition model, because the acoustic feature sequence obtained by preprocessing the original voice of the user is input into the preset voice recognition model, the preset voice recognition model is built after training based on CTC loss and AED loss, so that the text sequence is obtained to complete voice recognition, the technical problem that the existing voice recognition method needs to rely on a large amount of training sample data and training time is solved, and further the voice recognition can be accurately performed under the condition that the training sample data are fewer.
Based on the above-described first embodiment of the speech recognition device of the present invention, a second embodiment of the speech recognition device of the present invention is presented.
In this embodiment, the speech processing module 502 is further configured to screen training sample data from historical prior data, and perform data cleaning on the training sample data to obtain training sample data after data cleaning; dividing the training sample data after the data cleaning into a plurality of batches of samples based on batch size, wherein the batch size is the number size of the training sample data contained in the batch of samples; training the initial large voice model based on the batch of samples to obtain a preset voice recognition model.
Further, the speech processing module 502 is further configured to sequentially input the batch of samples into an initial large speech model for training, so as to obtain CTC loss and AED loss; obtaining joint loss according to the CTC loss and the AED loss, and carrying out model average based on the joint loss to obtain a preset voice recognition model; the calculation formula of the joint loss is as follows: l (L) combined (x,y)=λL CTC (x,y)+(1+λ)L AED (x, y); wherein the L is combined (x, y) represents the joint loss, the L CTC (x, y) represents the CTC loss, the L AED (x, y) represents the AED loss, x represents the acoustic signature in the batch of samples, y represents the label to which the acoustic signature corresponds, and λ represents the hyper-parameters that balance the CTC loss and the AED loss.
Further, the speech processing module 502 is further configured to input the batch of samples into an initial large speech model for training, and calculate a probability of maximizing alignment of a correct label during training to obtain a CTC loss, where a calculation formula of the CTC loss is: loss_ctc= -log (Σp (Y' |x, a)); wherein the loss_ctc represents the CTC Loss, X represents a given input acoustic feature, Y' represents an alignment correct tag, a represents all alignment cases, and Σ represents a summation operation.
Further, the speech processing module 502 is further configured to input the batch of samples into an initial large speech model for training, and obtain an AED loss by summing the tag loss and the attention loss during training, where a calculation formula of the AED loss is: loss_aed = loss_labal+loss_attribute; loss_labal= - Σ (log (P (y_i|y))); loss_attention=λ×gradient_duty+ε×attention_weight-priority_weight|; wherein the loss_aed represents the AED Loss, the loss_labal represents the label Loss, the loss_attention represents the Attention Loss, the y_i represents the i-th element in the label sequence Y, and the P (y_i|y) represents the probability that the initial large speech model generated the y_i; the gradient_penalty represents a gradient penalty term, the attention_weight-priority_weight represents an attention weight difference between an actual attention weight and a preset attention weight, and the lambda and the epsilon represent super-parameters for controlling the gradient penalty term and the attention weight difference, respectively.
Further, the speech processing module 502 is further configured to sample the initial large speech model at intervals of a preset batch, and store the sampled current model; obtaining a plurality of joint losses based on a sampling result, and screening two joint losses with the minimum loss value from the plurality of joint losses to respectively correspond to the epoch-a model and the epoch-b model; and carrying out model average based on the epoch-a model and the epoch-b model to obtain a preset voice recognition model.
Further, the speech processing module 502 is further configured to calculate sample point averages between the previous and subsequent i p samples in the epoch-a model and the epoch-b model, respectively, to obtain a first average model_avg [p×(m+1,m-1)] And a second average model_avg [p×(n+1,n-1)] The method comprises the steps of carrying out a first treatment on the surface of the Model average is carried out based on the first average value and the second average value, and a preset voice recognition model is obtained; wherein m represents the m-th sampling of the epoch-a model, n represents the n-th sampling of the epoch-b model, and p represents the epoch-a model and the epoch-b modelA corresponding number of batch samples for each sample.
Other embodiments or specific implementations of the speech recognition device of the present invention may refer to the above method embodiments, and are not described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of embodiments, it will be clear to a person skilled in the art that the above embodiment method may be implemented by means of software plus a necessary general hardware platform, but may of course also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a computer readable storage medium (e.g. read only memory/random access memory, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (4)

1. A method of speech recognition, the method comprising the steps of:
collecting original voice of a user, and preprocessing the original voice of the user to obtain an acoustic characteristic sequence;
inputting the acoustic feature sequence into a preset voice recognition model, so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, wherein the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss;
completing speech recognition of the original speech of the user based on the text sequence;
wherein the CTC penalty is for training tasks without aligned tag sequences, the AED penalty comprises a tag penalty for measuring the difference between the initial large speech model's predictions of output tag sequences during training and tags, and an attention penalty for measuring the difference between the initial large speech model's generated attention weights during training and the expected attention weights;
the step of collecting the original voice of the user and preprocessing the original voice of the user to obtain an acoustic characteristic sequence is preceded by the following steps:
screening training sample data from the historical prior data, and performing data cleaning on the training sample data to obtain training sample data after data cleaning;
dividing the training sample data after the data cleaning into a plurality of batches of samples based on batch size, wherein the batch size is the number size of the training sample data contained in the batch of samples;
inputting the batch of samples into an initial large voice model for training, and obtaining CTC loss by calculating the probability of maximum alignment of correct labels in the training process, wherein the calculation formula of the CTC loss is as follows:
Loss_CTC=-log(ΣP(Y’|X,A));
wherein the loss_ctc represents the CTC Loss, X represents a given input acoustic feature, Y' represents an alignment correct tag, a represents all alignment cases, and Σ represents a summation operation;
inputting the batch of samples into an initial large voice model for training, and summing the label loss and the attention loss in the training process to obtain AED loss, wherein the calculation formula of the AED loss is as follows:
Loss_AED=Loss_Labal+Loss_Attention;
Loss_Labal=-Σ(log(P(y_i|Y)));
Loss_Attention=λ*gradient_penalty+ε*|attention_weight-prior_weight|;
wherein the loss_aed represents the AED Loss, the loss_labal represents the label Loss, the loss_attention represents the Attention Loss, the y_i represents the i-th element in the label sequence Y, and the P (y_i|y) represents the probability that the initial large speech model generated the y_i; the gradient_penalty represents a gradient penalty term, the attention_weight-priority_weight represents an attention weight difference between an actual attention weight and a preset attention weight, and the lambda and the epsilon respectively represent super-parameters for controlling the gradient penalty term and the attention weight difference;
obtaining joint loss according to the CTC loss and the AED loss, and carrying out model average based on the joint loss to obtain a preset voice recognition model;
the calculation formula of the joint loss is as follows:
L combined (x,y)=λL CTC (x,y)+(1+λ)L AED (x,y);
wherein the L is combined (x, y) represents the joint loss, the L CTC (x, y) represents the CTC loss, the L AED (x, y) representing the AED loss, the x representing the acoustic signature in the batch of samples, the y representing the label to which the acoustic signature corresponds, the λ representing the hyper-parameter balancing the CTC loss and the AED loss;
the step of obtaining a preset speech recognition model by performing model average based on the joint loss comprises the following steps:
model sampling is carried out on the initial large voice model at intervals of preset batch quantity, and the sampled current model is stored;
obtaining a plurality of joint losses based on a sampling result, and screening two joint losses with the minimum loss value from the plurality of joint losses to respectively correspond to the epoch-a model and the epoch-b model;
respectively calculating the average value of sample points between the front and back i p samples in the epoch-a model and the epoch-b model to obtain a first average model_avg [p×(m+1,m-1)] And a second average model_avg [p×(n+1,n-1)]
Model average is carried out based on the first average value and the second average value, and a preset voice recognition model is obtained;
wherein m represents that the epoch-a model is subjected to an mth sampling, n represents that the epoch-b model is subjected to an nth sampling, and p represents the number of batch samples corresponding to each sampling in the epoch-a model and the epoch-b model.
2. A speech recognition apparatus based on the speech recognition method of claim 1, characterized in that the speech recognition apparatus comprises:
the voice processing module is used for collecting original voice of a user and preprocessing the original voice of the user to obtain an acoustic characteristic sequence;
the model output module is used for inputting the acoustic feature sequence into a preset voice recognition model so that the preset voice recognition model decodes the acoustic feature sequence to obtain a text sequence, and the preset voice recognition model is a model obtained by training an initial large voice model based on CTC loss and AED loss;
the voice recognition module is used for completing voice recognition of the original voice of the user based on the text sequence;
wherein the CTC penalty is for training tasks without aligned tag sequences, the AED penalty comprises a tag penalty for measuring the difference between the initial large speech model's predictions of output tag sequences during training and the tags, and an attention penalty for measuring the difference between the initial large speech model's generated attention weights during training and the expected attention weights.
3. A speech recognition device, the device comprising: a memory, a processor, and a speech recognition program stored on the memory and executable on the processor, the speech recognition program configured to implement the steps of the speech recognition method of claim 1.
4. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a speech recognition program which, when executed by a processor, implements the steps of the speech recognition method according to claim 1.
CN202310889848.3A 2023-07-20 2023-07-20 Speech recognition method, device, equipment and storage medium Active CN116631379B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310889848.3A CN116631379B (en) 2023-07-20 2023-07-20 Speech recognition method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310889848.3A CN116631379B (en) 2023-07-20 2023-07-20 Speech recognition method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116631379A CN116631379A (en) 2023-08-22
CN116631379B true CN116631379B (en) 2023-09-26

Family

ID=87621580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310889848.3A Active CN116631379B (en) 2023-07-20 2023-07-20 Speech recognition method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116631379B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112652295A (en) * 2020-12-22 2021-04-13 平安国际智慧城市科技股份有限公司 Language model training method, device, equipment and medium, and video subtitle checking method, device and medium
CN113362812A (en) * 2021-06-30 2021-09-07 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment
CN113870846A (en) * 2021-09-27 2021-12-31 平安科技(深圳)有限公司 Speech recognition method, device and storage medium based on artificial intelligence
CN114023316A (en) * 2021-11-04 2022-02-08 匀熵科技(无锡)有限公司 TCN-Transformer-CTC-based end-to-end Chinese voice recognition method
CN114255744A (en) * 2021-12-15 2022-03-29 山东新一代信息产业技术研究院有限公司 Online end-to-end automatic voice recognition method
CN114420107A (en) * 2022-01-12 2022-04-29 平安科技(深圳)有限公司 Speech recognition method based on non-autoregressive model and related equipment
CN114882874A (en) * 2022-05-30 2022-08-09 平安科技(深圳)有限公司 End-to-end model training method and device, computer equipment and storage medium
CN115249479A (en) * 2022-01-24 2022-10-28 长江大学 BRNN-based power grid dispatching complex speech recognition method, system and terminal
US11580957B1 (en) * 2021-12-17 2023-02-14 Institute Of Automation, Chinese Academy Of Sciences Method for training speech recognition model, method and system for speech recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11756551B2 (en) * 2020-10-07 2023-09-12 Mitsubishi Electric Research Laboratories, Inc. System and method for producing metadata of an audio signal

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112652295A (en) * 2020-12-22 2021-04-13 平安国际智慧城市科技股份有限公司 Language model training method, device, equipment and medium, and video subtitle checking method, device and medium
CN113362812A (en) * 2021-06-30 2021-09-07 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment
CN113870846A (en) * 2021-09-27 2021-12-31 平安科技(深圳)有限公司 Speech recognition method, device and storage medium based on artificial intelligence
CN114023316A (en) * 2021-11-04 2022-02-08 匀熵科技(无锡)有限公司 TCN-Transformer-CTC-based end-to-end Chinese voice recognition method
CN114255744A (en) * 2021-12-15 2022-03-29 山东新一代信息产业技术研究院有限公司 Online end-to-end automatic voice recognition method
US11580957B1 (en) * 2021-12-17 2023-02-14 Institute Of Automation, Chinese Academy Of Sciences Method for training speech recognition model, method and system for speech recognition
CN114420107A (en) * 2022-01-12 2022-04-29 平安科技(深圳)有限公司 Speech recognition method based on non-autoregressive model and related equipment
CN115249479A (en) * 2022-01-24 2022-10-28 长江大学 BRNN-based power grid dispatching complex speech recognition method, system and terminal
CN114882874A (en) * 2022-05-30 2022-08-09 平安科技(深圳)有限公司 End-to-end model training method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN116631379A (en) 2023-08-22

Similar Documents

Publication Publication Date Title
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
US10332507B2 (en) Method and device for waking up via speech based on artificial intelligence
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
WO2019204547A1 (en) Systems and methods for automatic speech recognition using domain adaptation techniques
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
Cheng et al. A call-independent and automatic acoustic system for the individual recognition of animals: A novel model using four passerines
CN110556130A (en) Voice emotion recognition method and device and storage medium
CN112289299B (en) Training method and device of speech synthesis model, storage medium and electronic equipment
CN109658921B (en) Voice signal processing method, equipment and computer readable storage medium
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN111554270B (en) Training sample screening method and electronic equipment
CN110930975A (en) Method and apparatus for outputting information
CN106448660B (en) It is a kind of introduce big data analysis natural language smeared out boundary determine method
CN114913859B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium
CN113327596B (en) Training method of voice recognition model, voice recognition method and device
CN113838462A (en) Voice wake-up method and device, electronic equipment and computer readable storage medium
CN116631379B (en) Speech recognition method, device, equipment and storage medium
CN113160823A (en) Voice awakening method and device based on pulse neural network and electronic equipment
CN113345410A (en) Training method of general speech and target speech synthesis model and related device
JP5091202B2 (en) Identification method that can identify any language without using samples
Vinay et al. Dysfluent Speech Classification Using Variational Mode Decomposition and Complete Ensemble Empirical Mode Decomposition Techniques with NGCU based RNN
CN113889085B (en) Speech recognition method, apparatus, device, storage medium, and program product
CN113555005B (en) Model training method, model training device, confidence determining method, confidence determining device, electronic equipment and storage medium
Räsänen et al. A noise robust method for pattern discovery in quantized time series: the concept matrix approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant