CN113129867B

CN113129867B - Training method of voice recognition model, voice recognition method, device and equipment

Info

Publication number: CN113129867B
Application number: CN201911384482.4A
Authority: CN
Inventors: 汪海涛
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd
Priority date: 2019-12-28
Filing date: 2019-12-28
Publication date: 2024-05-24
Anticipated expiration: 2039-12-28
Also published as: CN113129867A

Abstract

The embodiment of the invention discloses a training method of a voice recognition model, a voice recognition method, a device and equipment, wherein the method comprises the following steps: determining a voice training sample according to the audio data of the target object, wherein the voice training sample comprises semantic information and audio feature information; inputting semantic information and audio characteristic information into the voice recognition model, and performing iterative training on the voice recognition model until preset training conditions are met, so as to obtain a trained target voice recognition model. The problem of the prior art that voiceprint recognition accuracy is not high is solved.

Description

Training method of voice recognition model, voice recognition method, device and equipment

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a training method of a voice recognition model, a voice recognition method, a device, terminal equipment and a storage medium.

Background

Voiceprint recognition (Speaker Recognition) is to identify individuals by computer using physiological or behavioral characteristics inherent to the human body. Voiceprint recognition is classified into speaker recognition, which is determined as a certain one among a plurality of reference speakers according to speaker voices, and speaker confirmation; the latter is to verify whether the speaker identity is consistent with his voiceprint.

At present, in the process of speaker recognition, due to the incomplete detection process, a dialogue is divided into a plurality of voice segments, and each voice segment contains a plurality of voices, so that the accuracy of distinguishing a specific voice is reduced. In addition, if the speaking content related to the target speaker is to be confirmed, a large number of audio clips are required to be acquired to find the front and rear speaking content of the target speaker, so that when the sequence of sentences is disordered, whether the identity of the speaker is consistent with the voiceprint of the speaker cannot be confirmed.

Disclosure of Invention

The embodiment of the invention provides a training method, a voice recognition method, a device, terminal equipment and a storage medium for a voice recognition model, which are used for solving the problem of low voiceprint recognition precision in the related technology.

In order to solve the technical problems, the invention is realized as follows:

In a first aspect, an embodiment of the present invention provides a method for training a speech recognition model, where the method includes:

determining a voice training sample according to the audio data of the target object, wherein the voice training sample comprises semantic information and audio feature information;

Inputting semantic information and audio characteristic information into a voice recognition model, and performing iterative training on the voice recognition model until a preset training condition is met, so as to obtain a trained target voice recognition model.

In the embodiment of the invention, semantic information and audio characteristic information corresponding to the audio data are obtained by analyzing the audio data; then, training the voice recognition model according to the semantic information and the audio feature information, so that even when the dialogue audio is divided into a plurality of fragments, the target object can be determined according to the audio feature information, and the identity feature of the target object is recognized according to the semantic information, so that the target object can be accurately tracked in the dialogue audio, the accuracy of recognizing the target object in the dialogue audio is improved, and the identity information of the target object is determined under the condition that the target object is recognized, so that the application scene of the dialogue audio is obtained.

In a possible embodiment, the step of inputting the semantic information and the audio feature information into the speech recognition model and performing iterative training on the speech recognition model until a preset training condition is met to obtain the trained target speech recognition model may specifically include:

the following steps are respectively executed for each voice training sample: inputting the semantic information and the audio feature information into a voice recognition model to obtain a similarity prediction result of the semantic information and the audio feature information;

adjusting the voice recognition model according to each similarity prediction result;

and carrying out iterative training on the adjusted voice recognition model according to the voice training sample until a preset training condition is met, so as to obtain a trained target voice recognition model.

In another possible embodiment, the "speech recognition model" in the embodiment of the present invention may include a transcription network model, and based on this, in the step of determining a speech training sample according to the audio data of the target object, the method may specifically include:

Inputting the audio feature vector of the audio data into a transcription network model to obtain semantic information;

The voice information is used for determining text data corresponding to the audio data.

In yet another possible embodiment, the "speech recognition model" in the embodiment of the present invention may include a predictive network model, and based on this, in the step of determining a speech training sample according to the audio data of the target object, the method may specifically include:

under the condition of training the voice recognition model for the first time, inputting a preset similarity prediction result into a prediction network model to obtain audio characteristic information;

Under the condition that the voice recognition model is subjected to the Nth training, a similarity prediction result output from the N-1 th training is input into a prediction network model, and the Nth audio characteristic information is obtained;

wherein N is an integer greater than 1, and the audio feature information is used to determine identity information of the target object.

In still another possible embodiment, the "speech recognition model" in the embodiment of the present invention may further include a joint network model, based on which, in the step of inputting semantic information and audio feature information into the speech recognition model to obtain a similarity prediction result of the semantic information and the audio feature information, the method specifically may include:

Inputting semantic information and audio characteristic information into a joint network model to obtain hidden data comprising text information of audio data and identity information of a target object;

and inputting the hidden data into the classification model to obtain a similarity prediction result of the text information and the identity information.

In still another possible embodiment, the training method of the speech recognition model related to the foregoing may further include:

An audio feature vector is determined from the audio data of the target object by mel-frequency cepstrum coefficient MFCC.

The step of determining the audio feature vector according to the audio data of the target object through the mel-frequency cepstrum coefficient MFCC may specifically include:

Acquiring audio data of a target object;

carrying out framing treatment on the waveform diagram of the audio data to obtain at least one frame segment;

Performing Discrete Fourier Transform (DFT) on each frame segment in at least one frame segment to determine a power spectrum of each frame segment;

and carrying out data conversion on the power spectrum to obtain an audio feature vector.

In still another possible embodiment, before the step of performing the discrete fourier transform DFT on each frame segment of the at least one frame segment, the method may further include:

Each frame segment is smoothed by a hamming window.

In a second aspect, an embodiment of the present invention provides a method for speech recognition using a target speech recognition model, where the method may include:

acquiring target audio data;

Inputting the target audio data into a target voice recognition model to obtain dialogue information; wherein,

The dialogue information includes: text data corresponding to the target audio data, wherein the text data carries the identity of the target object.

In the embodiment of the invention, the received target audio data is input into the trained voice recognition model, so that the target object in the target audio data and the identity information of the target object can be recognized, and the target object can be accurately tracked in the audio data through the voice recognition model trained in the first aspect, thereby improving the accuracy of recognizing the target object in the audio, and determining the identity information of the target object under the condition of recognizing the target object, thereby obtaining the application scene of the dialogue audio.

In a possible embodiment, the step related to "obtaining the target audio data" may specifically include:

preprocessing the received audio data to obtain target audio data;

Wherein the preprocessing includes data cleaning and/or noise reduction.

In a third aspect, an embodiment of the present invention provides a training apparatus for a speech recognition model, where the apparatus may include:

the processing module is used for determining a voice training sample according to the audio data of the target object, wherein the voice training sample comprises semantic information and audio characteristic information;

The generation module inputs the semantic information and the audio characteristic information into the voice recognition model, and carries out iterative training on the voice recognition model until the preset training condition is met, so as to obtain a trained target voice recognition model.

In a fourth aspect, an embodiment of the present invention provides a speech recognition apparatus using a target speech recognition model, where the speech recognition model is trained by the method shown in the first aspect or the apparatus shown in the third aspect, and the apparatus includes:

the acquisition module is used for acquiring target audio data;

The processing module is used for inputting the target audio data into the target voice recognition model to obtain dialogue information; wherein,

In a fifth aspect, an embodiment of the present invention provides a terminal device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program when executed by the processor implements a method for training a speech recognition model as shown in any one of the first aspects, or speech recognition using the speech recognition model as shown in any one of the second aspects.

In a sixth aspect, there is provided a computer-readable storage medium having stored thereon a computer program which, if executed in a computer, causes the computer to perform the method of training a speech recognition model as set forth in any one of the first aspects or speech recognition using a speech recognition model as set forth in any one of the second aspects.

Drawings

The invention will be better understood from the following description of specific embodiments thereof taken in conjunction with the accompanying drawings in which like or similar reference characters designate like or similar features.

FIG. 1 is a schematic diagram of a training method of a speech recognition model and an implementation flow of the speech recognition method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an implementation flow of a voice recognition method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a training method of a speech recognition model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a speech recognition model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a transcriptional network model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a predictive network model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a joint network model according to an embodiment of the present invention;

FIG. 8 is a flowchart of a method for speech recognition according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a training device for a speech recognition model according to an embodiment of the present invention

Fig. 10 is a schematic structural diagram of a voice recognition method device according to an embodiment of the present invention;

Fig. 11 is a schematic hardware structure of a terminal device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Voiceprint recognition, which is a biometric technique, is a process of automatically determining whether a speaker is within an established speaker set and determining who the speaker is by analyzing and extracting a received speaker speech signal. Voiceprint recognition is divided into speaker recognition (Speaker Identification), which is a selection problem based on speaker speech determined to be one of a plurality of reference speakers, and speaker verification (Speaker Verification); the latter is a matter of decision to confirm whether the identity of the speaker is consistent with the statement thereof. The voice print recognition of the predetermined content of the speaker is called text-independent voice print recognition, the voice print recognition of which the predetermined content of the speaker can be said is called text-independent voice print recognition.

The main task of speaker recognition is to identify who said what, i.e. the speaker classification task is a key step in the automatic understanding of human dialog audio. For example, in a doctor's dialogue with the patient, the patient speaks Yes in response to the doctor's question (do you take heart disease medications often? Is quite different in meaning.

Conventional speaker separation and speech recognition is largely divided into two parts, automatic speech recognition (automatic speech recognition, ASR) and speaker classification (speaker diarization, SD), respectively. Wherein, ASR results are words corresponding to the voice, SD results are speakers corresponding to the voice fragments. Combining these two results we can get "who said what". The following is a brief description of the specific implementation of these two processes.

Traditional speaker classification (SD) systems are divided into two steps, the first of which is to detect changes in the sound spectrum to determine when a speaker switches; the second step is to identify each speaker in the conversation. Traditional speaker classification systems rely on acoustic differences in voice to distinguish between different speakers in a conversation. The sounds of men and women are easily distinguished, and the pitches (pitch) of the men and the women are greatly different, so that the sounds can be distinguished by using a simple acoustic model, and the sounds can be finished in one step, and speakers with similar pitches are distinguished by the following modes:

First, a change detection algorithm uniformly segments the dialog into segments, based on the detected speech features, where each segment is expected to contain only one speaker. Next, a deep learning model is used to map the sound clips from each speaker into an embedded vector. In the last step of the clustering process, these embeddings are clustered together to track the same speaker in a conversation. In practice, a speaker classification system is parallel to an Automatic Speech Recognition (ASR) system, and the recognized words are labeled in combination with the outputs of the two systems. The automatic speech recognition system is mainly a pattern matching method. During the training phase, the user speaks each word in the vocabulary one time in turn and stores its feature vector as a template into the template library. In the recognition stage, the feature vector of the input voice is compared with the similarity of each template in the template library in sequence, and the highest similarity is output as a recognition result. The process is currently implemented with a connection timing classification (Connectionist Temporal Classification, CTC) algorithm.

Although the above approach has many advantages in voiceprint recognition, there are also many limitations, as will be described in detail below:

First, the dialog needs to be divided into segments, and each segment contains only one person's voice. Otherwise, the embedding cannot accurately characterize the speaker. However, the correlation algorithm is not complete, which results in the segmented segments containing multiple voices.

Second, the number of speakers needs to be determined during the clustering process, and this stage is very sensitive to the accuracy of the input. In addition, a difficult tradeoff needs to be made between the segment size used to estimate the speech features and the required model accuracy in the clustering process. The longer the segment, the higher the speech feature quality, because the model has more speaker-dependent information. This results in the model possibly attributing short insert words to the wrong speaker, with very serious consequences, such as the need for accurate tracking of both positive and negative answers in the clinical, financial context.

Third, conventional speaker classification systems do not have a simple mechanism to exploit linguistic cues that are particularly prominent in many natural conversations. For example, "how long you eat this drug? The most likely is what the healthcare worker says in a clinical dialogue scenario. Similarly, "when we need to do business? It is likely that the student, not the teacher, will speak. Thus, current speech recognition approaches do not accurately analyze the speech content, so that the semantics and context associated with the speech cannot be accurately recognized.

In summary, aiming at the problems in the related art, the embodiment of the invention provides a training method, a voice recognition method, a device, a terminal device and a storage medium for a voice recognition model, so as to solve the problem of low voice print recognition precision in the related art.

The embodiment of the invention provides a training method and a voice recognition method of a voice recognition model for researching automatic voice recognition and speaker distinction, wherein the overall flow of the two methods is shown in a figure 1 and is mainly divided into two parts: the process of building and training a speech recognition model (left part of fig. 1) and the process of performing speech recognition based on the trained model (right part of fig. 1).

Further, the establishment of the speech recognition model may mainly include the following steps:

(1) Collecting data, including data collected on mobile phones, computers and other devices and data downloaded from online public data sets, wherein the formats of the data are WAVE, MPEG, MP, WMA and the like;

(2) Data cleaning, namely cleaning partial data due to the conditions of unclear, unclear language, distortion and the like of the data collected from equipment, so that a Chinese or English data set with high definition is reserved;

(3) And (3) adding a label, wherein the data acquired in the step (1) are in an audio format, and no corresponding text and speaker label exist, so that the label needs to be added to prepare for training.

(4) Training a speech recognition model, namely determining a speech training sample according to the audio data (such as the audio collected in the step (1), and in certain scenes, the speech data can be labeled in the step (3)) of the target object, wherein the speech training sample comprises semantic information and audio feature information;

Inputting semantic information and audio characteristic information into the voice recognition model, and performing iterative training on the voice recognition model until preset training conditions are met, so as to obtain a trained target voice recognition model.

The second part is speech recognition using a target speech recognition model, which may include:

(1) Collecting dialogs to be analyzed, and storing the dialogs as audio files;

(2) Data cleaning, which can denoise the audio file due to possible noise or other non-voice sounds in the acquisition process;

(3) The denoised audio is input into a target speech recognition model (such as jointASR +SD in FIG. 2) to obtain corresponding text and speaker information (such as speaker spear: word1, speaker spear2: word2 word3, speaker spear: word4, etc. in FIG. 2).

The method of the two parts simultaneously utilizes the information of the two parts of sound and language, and has the modeling capability of a language model in the speaker recognition process. The model may have quite good effects when the speaker has an explicit role, such as in a typical scenario of a doctor-patient conversation, shopping, etc.

Based on the above application scenario, the following describes the training method of the speech recognition model in detail.

Fig. 3 is a flowchart of a training method of a speech recognition model according to an embodiment of the present invention.

As shown in fig. 3, the training method of the speech recognition model may specifically include steps 310 to 330, which are specifically shown as follows:

Step 310: and determining a voice training sample according to the audio data of the target object, wherein the voice training sample comprises semantic information and audio characteristic information.

Here, in one possible embodiment, before performing step 310, the audio data needs to be converted into a format that can be recognized by the transcription network model and/or the prediction network model, and thus, the method may further include:

an audio feature vector is determined from audio data of the target object by mel-frequency cepstral coefficients (Frequency Cepstral Coefficient, MFCC).

This step is further described below:

(1) Audio data of the target object is acquired.

(2) And carrying out framing treatment on the waveform diagram of the audio data to obtain at least one frame segment.

For example, it is typical to take a width of 20-40 milliseconds (ms) as one frame, and in embodiments of the present invention, it is possible to take a width of 25ms as one frame, and for a 44.1kHz sampled signal, one frame contains 0.040×44100=1764 samples, and the frame shift is 20ms, allowing for a 20ms overlap between every two frames. Thus, the first frame is from sample 1 to sample 1764, the second frame is from sample 883 to sample 2646, and up to the last sample, if the audio length is not divided by the number of frames, 0 is added last. For one 15 s audio data, 44100 x 15/882=750 frames can be obtained.

(3) A discrete fourier transform (Discrete Fourier Transform, DFT) is performed on each of the at least one frame segment to determine a power spectrum for each frame segment.

Wherein determining the power spectrum of each frame segment can be accomplished by the following formulas (1) and (2):

The DFT transformation is actually two "correlation" operations, one is related to the cos sequence of frequency k of the audio data, and the other is related to the sin sequence of frequency k, and then the two are superimposed as a result of the correlation with the sine wave of frequency k, and if the obtained value is large, it indicates that the audio data contains a large amount of energy of frequency k.

(4) And carrying out data conversion on the power spectrum to obtain an audio feature vector.

For example, the conversion equation (3) between Mel-spaced filter banks Mel-SPACED FILTER bank frequency and Mel frequency is calculated as:

M(f)＝1125ln(1+f/700)

M^-1(m)＝700(exp(m/1125)-1) (3)

The mel interval filter set is a set of nonlinear filter set, which is densely distributed in the low-frequency part and sparsely distributed in the high-frequency part, and the distribution is used for better meeting the auditory characteristics of human ears. Next, log is taken from the 128-dimensional Mel power spectrum determined in equation (3) above, resulting in a 128-dimensional filter bank energy log-Mel filer bank energies (i.e., the capacity of k in step (3)). The reason for this is that the non-linear relationship represented by log is more accurate because the perception of sound by the human ear is not linear.

Based on the above steps (1) - (4), sometimes in order to make the obtained audio feature vector more accurate, in a possible example, before the above step (3), it may further include:

Each frame segment is smoothed by a hamming window.

Here, the purpose of windowing is to smooth the signal, and smoothing with a hamming window reduces post-FFT side lobe size and spectral leakage compared to a rectangular window function.

In the embodiment of the present invention, the windowing processing formula (4) for the signal using hamming window is as follows:

Thus, a piece of audio data is converted into a set of audio feature vectors having a time sequence.

Based on this, the speech recognition model may, in one possible embodiment, here comprise at least one of the following sub-models: a transcription network model, a prediction network model, a joint network model.

Where the speech recognition model comprises a transcribing network model, this step 310 may specifically comprise:

And/or, where the speech recognition model comprises a predictive network model, this step 310 may specifically comprise:

under the condition that the voice recognition model is subjected to the Nth training, inputting a similarity prediction result output from the N-1 training into a prediction network model to obtain the Nth audio characteristic information;

It should be noted that the above two cases may be overlapped, that is, when the speech recognition model includes a transcription network model and a prediction network model, the above steps may be used to determine the semantic information.

To further explain this step, the following may be exemplified:

The speech recognition model referred to in the embodiments of the present invention is obtained on the basis of a recurrent neural network sensing (Recurrent Neural Network Transducer, RNN-T) model. The speech recognition model is mainly characterized in that the model realizes seamless combination of voice and language clues, and speaker classification and speech recognition are integrated into the same system. Compared with the recognition system with single category, the integrated model can not greatly reduce the voice recognition performance, but can greatly improve the speaker distinguishing effect.

This integrated speech recognition model may be trained like a speech recognition system. The data of the training reference includes a transcription of the speaker's voice and a tag that distinguishes the speaker. For example, "when a job is submitted? "student >," I want you to submit before the lesson on tomorrow "," teacher >. When training the model using audio and corresponding reference transcript text examples, the user may input more dialogue recordings and obtain similar forms of output.

Step 320: inputting semantic information and audio characteristic information into the voice recognition model, and performing iterative training on the voice recognition model until preset training conditions are met, so as to obtain a trained target voice recognition model.

Wherein the following steps are performed for each speech training sample: inputting the semantic information and the audio feature information into a voice recognition model to obtain a similarity prediction result of the semantic information and the audio feature information; adjusting the voice recognition model according to each similarity prediction result; and carrying out iterative training on the adjusted voice recognition model according to the voice training sample until a preset training condition is met, so as to obtain a trained target voice recognition model.

Here, based on the possibilities in step 310, when the speech recognition model further includes a joint network model, step 320 may specifically include:

Thus, to further explain how to obtain a similarity prediction of semantic information and audio feature information, the following illustrates the steps in connection with the two sub-models involved in step 310:

as shown in fig. 4, each sub-model is separately introduced when the speech recognition model (e.g., the concrete implementation of the join asr+sd model in fig. 2) may include a transcription Network model (Transcription Network), a prediction Network model (Prediction Network), and a Joint Network model (join Network).

(1) Transcription network model

The transcribing network model is also called an encoder, which receives the audio feature vector processed in step 310 as input, trains the neural network, and outputs intermediate variablesThe variable contains semantic information of the audio data, and the semantic information can be used for training text information corresponding to voice later, namely determining text data corresponding to the audio data.

(2) Predictive network model

The main function of the predictive network model is to obtain the characteristics of the speaker, which receives the output of the last joint network model as input, and outputs intermediate variables after training by the neural network layerThe variable contains the audio characteristic information corresponding to each section of voice, namely the speaker information, and can be used for training the speaker information corresponding to the voice.

Here, it should be noted that, when there is no output of the last joint network model, that is, when training is performed for the first time, the preset similarity prediction result is input into the prediction network model, so as to obtain audio feature information; under the condition that the voice recognition model is subjected to the Nth training, inputting a similarity prediction result output from the Nth training into a prediction network model to obtain the Nth audio characteristic information; wherein N is an integer greater than 1, and the audio feature information is used to determine identity information of the target object.

(3) Combined network model

The joint network model receives the output results of the transcription network model and the prediction network model, and combinesAnd/>And obtaining similarity prediction results corresponding to the labels after training through the neural network layer as input, and inputting the similarity prediction results into the prediction network model again. This is a feedback loop in the model where previously recognized words are fed back as input and the RNN-T model can integrate linguistic cues, such as the end of a problem, which is also the core reason to enable speaker differentiation. In order to obtain the final corresponding text and speaker, in the embodiment of the invention, the label with the highest probability can be directly selected, or the label group with the highest global probability can be selected by integrating each time period.

Further, to better illustrate how the speech recognition model is trained in the embodiments of the present invention, a specific example is given as follows:

As shown in fig. 4, the input of the transcription network model is represented by a symbol sequence x= [ X ₁,x₂,...x_T ], where t represents the number of symbols in the sequence, corresponding to the audio cut-off number, X _t e d is the feature obtained by the Mel filter, and d is equal to 80. The corresponding predictive network model may be represented by the symbol sequence y= [ Y ₁,y₂,...y_U ] including the result of speech recognition and the labeling of the speaker, where Y _u e Ω, Ω is the full output space of the RNN-T network. And the trained kernel function is shown by equation (5):

based on this, three main transcription network models, prediction network models, and joint network models in the speech recognition model are described in detail below, respectively.

(1)Transcription Network

The audio feature vector is taken as input and has a dimension of 80. For ease of training, long audio is divided into audio segments of up to 15 seconds, each of which may be talking to multiple persons. Since longer units are more suitable for speech recognition, the time resolution of the output sequence can be reduced, thereby improving the efficiency of training and reasoning. For this reason, in the embodiment of the present invention, a hierarchy of Time Delay Neural Network (TDNN) layers is used to reduce the time resolution from 10ms to 80ms. This architecture is very similar to the encoder for CTC word models, and this extraction increases the speed of reasoning and reduces the recognition error rate.

Specifically, the Transcription Network model consists of three identical blocks consisting of four layers as shown in FIG. 5:

(1) A one-dimensional temporal convolution layer having 512 filters, the convolution layer having a kernel size of 5, plus an operator max pooling having a size of 2; (2) Three bi-directional long and short term neural network (LSTM) layers with 512 cells. The Transcription Network model was trained using an ADAM optimizer based on random gradients.

(2)Prediction Network

The Prediction Network model receives the previous result y _u-1 as input, and firstly passes through a word embedding layer to form a word, wherein the word embedding layer can map the morpheme vocabulary of 4096 units to a 512-dimensional vector space; taking the output of the space as the input of an LSTM layer, wherein the LSTM layer has 1024 units; and finally a full connection layer with 512 cells. The process can be expressed as formula (6):

A single layer LSTM network can be represented by fig. 6, mainly comprising the following parts:

Forgetting door of LSTM

The forgetting gate (forget gate) controls whether to forget, i.e. in LSTM, whether to forget the hidden cell state of the upper layer with a certain probability.

LSTM input gate

The next step is to decide how much new information to let in the cell state. Achieving this need involves two processes: firstly, a sigmoid layer called input GATE LAYER decides which information needs to be updated; a tanh layer generates a vector, i.e. alternative content for updating.

Cell status update for LSTM

Before studying LSTM export gates, we first look at the LSTM cell status. The results of both the forgotten gate and the input gate will be applied to the cell state C (t) C (t). We look at how C (t) C (t) is derived from the cell state C (t-1) C (t-1).

LSTM output gate

With the new hidden cell state C (t) C (t), we can see that the updating of the hidden state h (t) h (t) is made up of two parts, the first part being o (t) o (t), which is obtained from the hidden state h (t-1) h (t-1) of the previous sequence and the present sequence data x (t) x (t), and the activation function sigmoid, the second part being made up of the hidden state C (t) C (t) and the tanh activation function.

(3)Joint Network

As shown in fig. 7, the combination of the inputs Transcription Network and Prediction Network outputs of the join Network model is then input into a fully connected neural Network layer having 512 hidden units, and the result is then output into a softmax layer having 4096 units, resulting in the final results y1, y2, and y3. The value of the output layer, namely the label to be trained, is set as a combination of words and speakers, and the implementation manner can be as follows:

hello dr jekyll<spk:pt>

hello mr hyde what brings you here today<spk:dr>

I am struggling again with my bipolar disorder<spk:pt>

Here, it should be noted that, in the embodiment of the present invention, the preset training condition may include that, when the number of iterations satisfies a preset threshold (i.e., reaches the maximum limiting number), it may be determined that the preset training condition is satisfied, or when the accuracy before determining the preset result of similarity and the actual value reaches a preset threshold in the process of performing the iteration, it may be determined that the preset training condition is satisfied.

Therefore, in the embodiment of the invention, the semantic information and the audio characteristic information corresponding to the audio data are obtained by analyzing the audio data; then, training the voice recognition model according to the semantic information and the audio feature information, so that even when the dialogue audio is divided into a plurality of fragments, the target object can be determined according to the audio feature information, and the identity feature of the target object is recognized according to the semantic information, so that the target object can be accurately tracked in the dialogue audio, the accuracy of recognizing the target object in the dialogue audio is improved, and the identity information of the target object is determined under the condition that the target object is recognized, so that the application scene of the dialogue audio is obtained.

In summary, the embodiment of the invention researches the speaker distinguishing process by combining language information, fully utilizes known information and improves the recognition precision. In addition, since the above method does not require forced alignment, the text sequence itself can be used for learning training. Based on the RNN-T model, decoding is accelerated, and a large amount of blank exists, so that the model can use frame skip operation in the decoding process, thereby greatly accelerating the decoding process. The method has monotonicity, so that real-time online decoding can be performed, and the range of application scenes is increased.

In addition, the embodiment of the invention also provides a voice recognition method based on the trained voice recognition model.

Fig. 8 is a flowchart of a voice recognition method according to an embodiment of the present invention.

As shown in fig. 8, the method specifically may include:

Step 810, obtaining target audio data.

Here, in one possible embodiment, the received audio data is preprocessed to obtain the target audio data;

Wherein the preprocessing includes data cleaning and/or noise reduction.

Step 820, inputting the target audio data into the target speech recognition model determined in the above step 320 to obtain dialogue information; wherein,

Based on the two processes, the embodiment of the invention also provides two devices, namely a training device of a voice recognition model and a voice recognition device, which are specifically shown as follows.

Fig. 9 is a schematic structural diagram of a training device for a speech recognition model according to an embodiment of the present invention.

As shown in fig. 9, the training device 90 for a speech recognition model may specifically include:

the processing module 901 is configured to determine a speech training sample according to audio data of a target object, where the speech training sample includes semantic information and audio feature information;

The generating module 902 inputs the semantic information and the audio feature information into the speech recognition model, and performs iterative training on the speech recognition model until a preset training condition is met, so as to obtain a trained target speech recognition model.

The generating module 902 may specifically be configured to perform the following steps for each voice training sample: inputting the semantic information and the audio feature information into a voice recognition model to obtain a similarity prediction result of the semantic information and the audio feature information; adjusting the voice recognition model according to each similarity prediction result; and carrying out iterative training on the adjusted voice recognition model according to the voice training sample until a preset training condition is met, so as to obtain a trained target voice recognition model.

In one possible embodiment, the speech recognition model comprises a transcription network model. Based on this, the processing module 901 in the embodiment of the present invention may specifically include:

In another possible embodiment, the speech recognition model comprises a predictive network model; based on this, the generating module 902 in the embodiment of the present invention inputs the preset similarity prediction result into the prediction network model to obtain the audio feature information under the condition of performing the first training on the speech recognition model;

Under the condition that the voice recognition model is subjected to the Nth training, inputting a similarity prediction result output from the N-1 th training into a prediction network model to obtain the N-th audio characteristic information;

In yet another possible embodiment, the speech recognition model further comprises a federated network model; the generating module 902 in the embodiment of the present invention may be specifically configured to input semantic information and audio feature information into a joint network model to obtain hidden data including text information of audio data and identity information of a target object;

In addition, the training device 90 of the speech recognition model may further include a determining module 904 for determining an audio feature vector according to the audio data of the target object by mel-frequency cepstrum coefficient MFCC.

In one possible embodiment, the determining module 904 may be specifically configured to obtain audio data of the target object; carrying out framing treatment on the waveform diagram of the audio data to obtain at least one frame segment; performing Discrete Fourier Transform (DFT) on each frame segment in at least one frame segment to determine a power spectrum of each frame segment; and carrying out data conversion on the power spectrum to obtain an audio feature vector.

Based on this, the training apparatus 90 of the speech recognition model may further comprise a transformation module 905 for smoothing each frame segment through a hamming window.

In addition, fig. 10 is a schematic diagram of a voice recognition device using a target voice recognition model according to an embodiment of the present invention.

As shown in fig. 10, the voice recognition apparatus 100 may specifically include:

an acquisition module 1001, configured to acquire target audio data;

the processing module 1002 is configured to input target audio data into a target speech recognition model, to obtain dialogue information; wherein,

The obtaining module 1001 may be specifically configured to perform preprocessing on the received audio data to obtain target audio data; wherein the preprocessing includes data cleaning and/or noise reduction.

The terminal device 1100 includes, but is not limited to: radio frequency unit 1101, network module 1102, audio output unit 1103, input unit 1104, sensor 1105, display unit 1106, user input unit 1107, interface unit 1108, memory 1109, processor 1110, and power supply 1111. It will be appreciated by those skilled in the art that the terminal device structure shown in fig. 11 does not constitute a limitation of the terminal device, and the terminal device may comprise more or less components than shown, or may combine certain components, or may have a different arrangement of components. In the embodiment of the invention, the terminal equipment comprises, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer and the like.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 1101 may be used for receiving and transmitting signals during the process of receiving and transmitting information or communication, specifically, receiving downlink resources from a base station and then processing the downlink resources by the processor 1110; in addition, uplink resources are transmitted to the base station. Typically, the radio frequency unit 1101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 1101 may also communicate with networks and other devices through a wireless communication system.

The terminal device provides wireless broadband internet access to the user through the network module 1102, such as helping the user to send and receive e-mail, browse web pages, access streaming media, etc.

The audio output unit 1103 may convert audio resources received by the radio frequency unit 1101 or the network module 1102 or stored in the memory 1109 into audio signals and output as sound. Also, the audio output unit 1103 may also provide audio output (e.g., a call signal reception sound, a message reception sound, etc.) related to a specific function performed by the terminal device 1100. The audio output unit 1103 includes a speaker, a buzzer, a receiver, and the like.

The input unit 1104 is used for receiving an audio or video signal. The input unit 1104 may include a graphics processor (Graphics Processing Unit, GPU) 11041 and a microphone 11042, the graphics processor 11041 processing image resources of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frame may be displayed on the display unit 1107. The image frames processed by the graphics processor 11041 may be stored in memory 1109 (or other storage medium) or transmitted via the radio frequency unit 1101 or the network module 1102. The microphone 11042 may receive sound and be capable of processing such sound as an audio resource. The processed audio resources may be converted in case of a phone call mode into a format output that may be transmitted to the mobile communication base station via the radio frequency unit 1101.

Terminal device 1100 also includes at least one sensor 1105, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 11061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 11061 and/or the backlight when the terminal device 1100 moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when the accelerometer sensor is stationary, and can be used for recognizing the gesture (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking) and the like of the terminal equipment; the sensor 1105 may further include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which are not described herein.

The display unit 1106 is used to display information input by a user or information provided to the user. The display unit 1106 may include a display panel 11061, and the display panel 11061 may be configured in the form of a Liquid crystal display (Liquid CRYSTALDISPLAY, LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 1107 is operable to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the terminal device. Specifically, the user input unit 1107 includes a touch panel 11071 and other input devices 11072. The touch panel 11071, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 11071 or thereabout using any suitable object or accessory such as a finger, stylus, etc.). The touch panel 11071 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends the touch point coordinates to the processor 1110, and receives and executes commands sent from the processor 1110. In addition, the touch panel 11071 may be implemented in various types of resistive, capacitive, infrared, surface acoustic wave, and the like. The user input unit 1107 may include other input devices 11072 in addition to the touch panel 11071. In particular, other input devices 11072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.

Further, the touch panel 11071 may be overlaid on the display panel 11061, and when the touch panel 11071 detects a touch operation thereon or thereabout, the touch panel is transferred to the processor 1110 to determine a type of touch event, and then the processor 1110 provides a corresponding visual output on the display panel 11061 according to the type of touch event. Although in fig. 11, the touch panel 11071 and the display panel 11061 are two independent components to implement the input and output functions of the terminal device, in some embodiments, the touch panel 11071 may be integrated with the display panel 11061 to implement the input and output functions of the terminal device, which is not limited herein.

The interface unit 1108 is an interface for connecting an external device to the terminal apparatus 1100. For example, the external devices may include wired or wireless headset ports, external power (or battery charger) ports, wired or wireless resource ports, memory card ports, ports for connecting devices having identification modules, audio input/output (I/O) ports, video I/O ports, earphone ports, and the like. The interface unit 1108 may be used to receive input (e.g., resource information, power, etc.) from an external device and transmit the received input to one or more elements within the terminal apparatus 1100 or may be used to transmit resources between the terminal apparatus 1100 and an external device.

Memory 1109 may be used to store software programs and various resources. The memory 1109 may mainly include a storage program area and a storage resource area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage resource area may store resources (such as audio resources, phonebooks, etc.) created according to the use of the handset, etc. In addition, memory 1109 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 1110 is a control center of the terminal device, connects various parts of the entire terminal device using various interfaces and lines, and performs various functions and processing resources of the terminal device by running or executing software programs and/or modules stored in the memory 1109, and invoking resources stored in the memory 1109, thereby performing overall monitoring of the terminal device. Processor 1110 may include one or more processing units; preferably, the processor 1110 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1110.

The terminal device 1100 may further include a power supply 1111 (e.g., a battery) for supplying power to the respective components, and the power supply 1111 may be logically connected to the processor 1110 through a power management system, so that functions of managing charging, discharging, power consumption management, etc. are implemented through the power management system.

In addition, the terminal device 1100 includes some functional modules, which are not shown, and will not be described herein.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed in a computer causes the computer to execute the steps of the training method or the speech recognition method of the speech recognition model of the embodiment of the invention.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method of the embodiments of the present invention.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Claims

1. A method of training a speech recognition model, the method comprising:

inputting the semantic information and the audio feature information into a voice recognition model, and performing iterative training on the voice recognition model until a preset training condition is met, so as to obtain a trained target voice recognition model;

inputting the semantic information and the audio feature information into a voice recognition model, and performing iterative training on the voice recognition model until a preset training condition is met, so as to obtain a trained target voice recognition model, wherein the method comprises the following steps of:

Performing iterative training on the adjusted voice recognition model according to the voice training sample until a preset training condition is met, so as to obtain a trained target voice recognition model;

the speech recognition model includes a predictive network model; the determining a speech training sample according to the audio data of the target object comprises:

Under the condition of training the voice recognition model for the first time, inputting a preset similarity prediction result into the prediction network model to obtain the audio characteristic information;

Under the condition that the voice recognition model is subjected to the Nth training, inputting a similarity prediction result output from the N-1 th training into the prediction network model to obtain the N audio characteristic information;

2. The method of claim 1, wherein the speech recognition model comprises a transcription network model; the determining a speech training sample according to the audio data of the target object comprises:

inputting the audio feature vector of the audio data into the transcription network model to obtain the semantic information;

the semantic information is used for determining text data corresponding to the audio data.

3. The method of claim 1, wherein the speech recognition model further comprises a federated network model;

Inputting the semantic information and the audio feature information into a voice recognition model to obtain a similarity prediction result of the semantic information and the audio feature information, wherein the method comprises the following steps:

Inputting the semantic information and the audio characteristic information into the joint network model to obtain hidden data comprising text information of the audio data and identity information of the target object;

and inputting the hidden data into a classification model to obtain a similarity prediction result of the text information and the identity information.

4. The method according to claim 2, wherein the method further comprises:

and determining the audio feature vector according to the audio data of the target object through a Mel Frequency Cepstrum Coefficient (MFCC).

5. The method of claim 4, wherein the determining the audio feature vector from the audio data of the target object by mel-frequency cepstral coefficient MFCC comprises:

acquiring audio data of the target object;

performing Discrete Fourier Transform (DFT) on each frame segment in the at least one frame segment to determine a power spectrum of each frame segment;

and carrying out data conversion on the power spectrum to obtain the audio feature vector.

6. A speech recognition method using a target speech recognition model, the target speech recognition model being trained by the method of any one of claims 1-5, the method comprising:

acquiring target audio data;

inputting the target audio data into the target voice recognition model to obtain dialogue information; wherein,

7. The method of claim 6, wherein the obtaining the target audio data comprises:

Preprocessing the received audio data to obtain the target audio data;

Wherein the preprocessing includes data cleaning and/or noise reduction.

8. A training device for a speech recognition model, the device comprising:

the generation module is used for inputting the semantic information and the audio characteristic information into a voice recognition model, and carrying out iterative training on the voice recognition model until a preset training condition is met, so as to obtain a trained target voice recognition model;

the generating module is specifically configured to:

Under the condition of training the voice recognition model for the first time, inputting a preset similarity prediction result into a prediction network model to obtain the audio feature information, wherein the voice recognition model comprises the prediction network model;

9. A speech recognition apparatus utilizing a target speech recognition model, the target speech recognition model being trained by the method of claim 1, the apparatus comprising:

the acquisition module is used for acquiring target audio data;

10. Terminal device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, which computer program, when being executed by the processor, implements a method for training a speech recognition model according to any of claims 1-5 or a method for speech recognition using a target speech recognition model according to any of claims 6-7.

11. A computer-readable storage medium, having stored thereon a computer program which, if executed in a computer, causes the computer to perform the method of training a speech recognition model according to any one of claims 1 to 5 or the method of speech recognition using a target speech recognition model according to any one of claims 6 to 7.