CN113129867A

CN113129867A - Training method of voice recognition model, voice recognition method, device and equipment

Info

Publication number: CN113129867A
Application number: CN201911384482.4A
Authority: CN
Inventors: 汪海涛
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Shanghai ICT Co Ltd
Priority date: 2019-12-28
Filing date: 2019-12-28
Publication date: 2021-07-16
Anticipated expiration: 2039-12-28
Also published as: CN113129867B

Abstract

The embodiment of the invention discloses a training method of a voice recognition model, a voice recognition method, a device and equipment, wherein the method comprises the following steps: determining a voice training sample according to the audio data of the target object, wherein the voice training sample comprises semantic information and audio characteristic information; and inputting the semantic information and the audio characteristic information into the voice recognition model, and performing iterative training on the voice recognition model until preset training conditions are met to obtain a trained target voice recognition model. The method and the device solve the problem that voiceprint recognition accuracy is not high in the related technology.

Description

Training method of voice recognition model, voice recognition method, device and equipment

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a training method of a voice recognition model, a voice recognition method, a device, terminal equipment and a storage medium.

Background

Voiceprint Recognition (Speaker Recognition) is the identification of a person by a computer using physiological or behavioral characteristics inherent to the human body. The voiceprint recognition is divided into speaker recognition and speaker confirmation, wherein the speaker recognition is determined as a certain speaker in a plurality of reference speakers according to the voice of the speaker; the latter is to verify that the identity of the speaker is consistent with its voiceprint.

At present, in the process of speaker identification, due to incomplete detection process, a dialog is divided into a plurality of voice segments, each voice segment contains a plurality of voices, and thus, the accuracy of distinguishing a specific voice is reduced. In addition, if the speaking content related to the target speaker needs to be confirmed, a large number of audio segments need to be acquired to find the front and back speaking contents of the target speaker, so that when the order of sentences is disordered, the identity of the speaker cannot be confirmed to be consistent with the voiceprint of the speaker.

Disclosure of Invention

The embodiment of the invention provides a training method of a voice recognition model, a voice recognition method, a device, terminal equipment and a storage medium, and aims to solve the problem of low voiceprint recognition accuracy in the related art.

In order to solve the technical problem, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a method for training a speech recognition model, where the method includes:

determining a voice training sample according to the audio data of the target object, wherein the voice training sample comprises semantic information and audio characteristic information;

and inputting semantic information and audio characteristic information into a voice recognition model, and performing iterative training on the voice recognition model until preset training conditions are met to obtain a trained target voice recognition model.

In the embodiment of the invention, semantic information and audio characteristic information corresponding to audio data are obtained by analyzing the audio data; and then, training the voice recognition model according to the semantic information and the audio characteristic information, so that even when the dialogue audio is divided into a plurality of segments, the target object can be determined according to the audio characteristic information, and the identity characteristic of the target object is recognized according to the semantic information, so that the target object can be accurately tracked in the dialogue audio, the accuracy of recognizing the target object in the audio is improved, and in the case of recognizing the target object, the identity information of the target object is determined, so that the application scene of the dialogue audio is obtained.

In a possible embodiment, the aforementioned step of inputting the semantic information and the audio feature information into the speech recognition model, and performing iterative training on the speech recognition model until a preset training condition is met to obtain a trained target speech recognition model may specifically include:

respectively executing the following steps for each voice training sample: inputting the semantic information and the audio characteristic information into a voice recognition model to obtain a similarity prediction result of the semantic information and the audio characteristic information;

adjusting the voice recognition model according to each similarity prediction result;

and performing iterative training on the adjusted voice recognition model according to the voice training sample until a preset training condition is met to obtain a trained target voice recognition model.

In another possible embodiment, the "speech recognition model" in the embodiment of the present invention may include a transcription network model, and based on this, in the step of "determining a speech training sample according to audio data of a target object", specifically, the step may include:

inputting the audio characteristic vector of the audio data into a transcription network model to obtain semantic information;

the voice information is used for determining text data corresponding to the audio data.

In another possible embodiment, the "speech recognition model" in the embodiment of the present invention may include a prediction network model, and based on this, in the step of "determining speech training samples according to the audio data of the target object", the method specifically includes:

under the condition of carrying out first training on the voice recognition model, inputting a preset similarity prediction result into the prediction network model to obtain audio characteristic information;

under the condition that the speech recognition model is trained for the Nth time, inputting a similarity prediction result output from the training for the Nth-1 st time into a prediction network model to obtain the audio feature information for the Nth time;

and N is an integer greater than 1, and the audio characteristic information is used for determining the identity information of the target object.

In another possible embodiment, the "speech recognition model" in the embodiment of the present invention may further include a joint network model, and based on this, in the step of "inputting the semantic information and the audio feature information into the speech recognition model to obtain a similarity prediction result of the semantic information and the audio feature information", the method specifically includes:

inputting semantic information and audio characteristic information into a combined network model to obtain hidden data comprising text information of audio data and identity information of a target object;

and inputting the hidden data into the classification model to obtain a similarity prediction result of the text information and the identity information.

In yet another possible embodiment, the above-mentioned training method for a speech recognition model may further include:

an audio feature vector is determined from the audio data of the target object by means of mel-frequency cepstrum coefficients MFCC.

The step of determining the audio feature vector according to the audio data of the target object by using the mel-frequency cepstrum coefficient MFCC may specifically include:

acquiring audio data of a target object;

performing framing processing on the oscillogram of the audio data to obtain at least one frame segment;

performing Discrete Fourier Transform (DFT) on each frame segment in the at least one frame segment to determine a power spectrum of each frame segment;

and performing data conversion on the power spectrum to obtain an audio characteristic vector.

In still another possible embodiment, before the step of performing discrete fourier transform DFT on each frame segment of the at least one frame segment, the method further includes:

each frame segment is smoothed by a hamming window.

In a second aspect, an embodiment of the present invention provides a speech recognition method using a target speech recognition model, where the method may include:

acquiring target audio data;

inputting target audio data into a target voice recognition model to obtain dialogue information; wherein,

the dialog information includes: and text data corresponding to the target audio data, wherein the text data carries the identity of the target object.

In the embodiment of the invention, the received target audio data is input into the trained voice recognition model, so that the target object in the target audio data and the identity information of the target object can be recognized, the target object can be accurately tracked in the audio data through the trained voice recognition model in the first aspect, the precision of recognizing the target object in the audio is improved, and the identity information of the target object is determined under the condition of recognizing the target object, so that the application scene of the conversation audio is obtained.

In a possible embodiment, the step of "acquiring the target audio data" may specifically include:

preprocessing the received audio data to obtain target audio data;

wherein the pre-processing comprises data cleansing and/or noise reduction.

In a third aspect, an embodiment of the present invention provides a training apparatus for a speech recognition model, where the apparatus may include:

the processing module is used for determining a voice training sample according to the audio data of the target object, wherein the voice training sample comprises semantic information and audio characteristic information;

and the generation module is used for inputting the semantic information and the audio characteristic information into the voice recognition model, and performing iterative training on the voice recognition model until preset training conditions are met to obtain a trained target voice recognition model.

In a fourth aspect, an embodiment of the present invention provides a speech recognition apparatus using a target speech recognition model, where the speech recognition model is trained by the method shown in the first aspect or the apparatus shown in the third aspect, and the apparatus includes:

the acquisition module is used for acquiring target audio data;

the processing module is used for inputting the target audio data into the target voice recognition model to obtain dialogue information; wherein,

In a fifth aspect, an embodiment of the present invention provides a terminal device, including a processor, a memory, and a computer program stored on the memory and operable on the processor, where the computer program, when executed by the processor, implements a method for training a speech recognition model according to any one of the first aspect, or implements speech recognition using the speech recognition model according to any one of the second aspect.

In a sixth aspect, there is provided a computer-readable storage medium having stored thereon a computer program which, if executed in a computer, causes the computer to execute the method for training a speech recognition model according to any one of the first aspect or the speech recognition using the speech recognition model according to any one of the second aspect.

Drawings

The present invention will be better understood from the following description of specific embodiments thereof taken in conjunction with the accompanying drawings, in which like or similar reference characters designate like or similar features.

Fig. 1 is a schematic flow chart illustrating a training method of a speech recognition model and an implementation flow of a speech recognition method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart illustrating an implementation of a speech recognition method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for training a speech recognition model according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a speech recognition model according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a transcription network model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a prediction network model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a federated network model provided in the embodiment of the present invention;

FIG. 8 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a training apparatus for a speech recognition model according to an embodiment of the present invention

Fig. 10 is a schematic structural diagram of a speech recognition method according to an embodiment of the present invention;

fig. 11 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Voiceprint recognition belongs to a biometric identification technology, also called speaker recognition, and is a process of automatically determining whether a speaker is in an established speaker set and determining who the speaker is by analyzing and extracting a received speaker voice signal. The voiceprint recognition is divided into Speaker Identification (Speaker Identification) and Speaker Verification (Speaker Verification), wherein the Speaker voice is determined as one of a plurality of reference speakers according to the Speaker Identification, and the Speaker Identification is a selection problem; the latter is a decision problem to confirm whether the identity of the speaker is consistent with the claim of the speaker, which is an alternative. The voiceprint recognition in which the content of the speaker is predetermined is called text-dependent voiceprint recognition, and the voiceprint recognition in which the content of the speaker is not determined in advance and what content can be spoken is called text-independent voiceprint recognition.

The main task of speaker recognition is to identify who says what, that is, the speaker classification task is a key step in automatically understanding human conversational audio. For example, in a doctor-to-patient conversation, the patient says Yes when answering the doctor's question (do you often take heart disease medications? Are quite different in meaning.

The conventional speaker distinction and speech recognition are mainly divided into two parts, namely Automatic Speech Recognition (ASR) and speaker classification (SD). Wherein, the ASR result is the character corresponding to the voice, and the result obtained by the SD is the speaker corresponding to the voice segment. Combining these two results we can get "who said what". We will briefly describe the specific implementation of these two processes below.

Conventional speaker classification (SD) systems are divided into two steps, the first step being to detect changes in the sound spectrum, thereby determining when a speaker has switched; the second step is to identify each speaker in the conversation. Conventional speaker classification systems rely on acoustic differences in human voices to distinguish different speakers in a conversation. The voices of men and women are easy to distinguish, the pitches (pitch) of the men and women are very different, the voices can be distinguished by using a simple acoustic model and can be distinguished in one step, and speakers with similar pitches are distinguished by the following modes:

first, based on the detected speech features, a change detection algorithm uniformly segments the conversation into segments, each segment desirably containing only one speaker. Next, the voice segments from each speaker are mapped into an embedded vector using a deep learning model. In the final step of clustering, these embeddings are clustered together to track the same speaker in a conversation. In practice, a speaker classification system is run in parallel with an Automatic Speech Recognition (ASR) system, combining the outputs of the two systems to label recognized words. The automatic speech recognition system is mainly a pattern matching method. In the training phase, the user speaks each word in the vocabulary in turn and stores the feature vectors as templates in the template library. In the recognition stage, the feature vector of the input voice is compared with each template in the template library in similarity in sequence, and the highest similarity is output as a recognition result. This process is currently implemented using a Connection Timing Classification (CTC) algorithm.

Although the above approach has many advantages in voiceprint recognition, there are also limitations, as will be described in detail below:

first, the dialog needs to be segmented into segments, and each segment contains only one person's voice. Otherwise, the embedding does not accurately characterize the speaker. However, the related algorithm is not complete at present, which results in that the segmented segment contains a plurality of voices.

Second, the number of speakers needs to be determined during the clustering process, and this stage is very sensitive to the accuracy of the input. In addition, a difficult tradeoff between the segment size used to estimate the speech features and the required model accuracy is required in the clustering process. The longer the segment, the higher the speech feature quality because the model has more speaker-related information. This results in the possibility that the model will place a brief insert in the wrong speaker, thereby creating very serious consequences, such as the need for accurate tracking of both positive and negative answers in a clinical, financial context.

Third, conventional speaker classification systems do not have a simple mechanism to exploit linguistic cues that are particularly prominent in many natural dialogues. For example, "how long did you eat this drug? Most likely, it is said by the healthcare worker in the clinical conversation scenario. Similarly, "when we need to hand over? Likely spoken by a student rather than a teacher. Therefore, the current speech recognition mode cannot accurately analyze the speech content, so that the semantics and scenes related to the speech cannot be accurately recognized.

In summary, embodiments of the present invention provide a training method for a speech recognition model, a speech recognition method, an apparatus, a terminal device and a storage medium to solve the problem in the related art that the voiceprint recognition accuracy is not high.

The embodiment of the invention provides a training method and a voice recognition method of a voice recognition model, which are used for researching automatic voice recognition and speaker distinguishing, wherein the overall flow of the two methods is shown in figure 1 and mainly comprises two parts: the process of establishing and training a speech recognition model (left part of fig. 1) and the process of performing speech recognition based on the trained model (right part of fig. 1).

Further, the establishment of the speech recognition model may mainly include the following steps:

(1) collecting data, including data collected on devices such as mobile phones and computers and data downloaded from a public data set on the internet, wherein the formats of the data are WAVE, MPEG, MP3, WMA and the like;

(2) data cleaning, wherein the data collected from the equipment is unclear, language is unclear, distortion and the like, so that the part of data needs to be removed, and a Chinese or English high-definition data set is reserved;

(3) and (3) adding labels, wherein the data acquired in the step (1) are in audio format and have no corresponding characters or speaker labels, so that the labels need to be added to prepare for training.

(4) Training a speech recognition model, namely determining a speech training sample according to the audio data (such as the audio collected in the step (1) and the label added to the speech data in the step (3) in some scenes) of the target object, wherein the speech training sample comprises semantic information and audio characteristic information;

and inputting the semantic information and the audio characteristic information into the voice recognition model, and performing iterative training on the voice recognition model until preset training conditions are met to obtain a trained target voice recognition model.

The second part is speech recognition using a target speech recognition model, which may include:

(1) collecting dialogs to be analyzed and storing the dialogs as audio files;

(2) data cleaning, wherein noise or other non-voice sounds may occur in the acquisition process, so that the audio file can be denoised;

(3) the denoised audio is input into a target speech recognition model (such as join ASR + SD in FIG. 2) to obtain corresponding text and speaker information (such as speaker spear 1: word 1; speaker spear 2: word2 word 3; speaker spear 1: word4, etc. in FIG. 2).

The method of the two parts simultaneously utilizes the information of the sound and the language, and has the language model modeling capability in the speaker recognition process. The model has a quite good effect when the speaker has a definite role, such as in typical scenes of doctor-patient conversation, shopping and the like.

Based on the above application scenarios, the following describes in detail a training method of a speech recognition model according to an embodiment of the present invention.

Fig. 3 is a flowchart of a method for training a speech recognition model according to an embodiment of the present invention.

As shown in fig. 3, the method for training a speech recognition model may specifically include steps 310 to 330, which are specifically as follows:

step 310: and determining a voice training sample according to the audio data of the target object, wherein the voice training sample comprises semantic information and audio characteristic information.

Here, in a possible embodiment, before performing step 310, it is necessary to convert the audio data into a format that can be recognized by the transcription network model and/or the prediction network model, and thus, the method may further include:

an audio feature vector is determined from the audio data of the target object by means of Mel Frequency Cepstral Coefficient (MFCC).

This step is further explained below:

(1) audio data of a target object is acquired.

(2) And performing framing processing on the oscillogram of the audio data to obtain at least one frame segment.

For example, 20-40 milliseconds (ms) is typically taken as the width of a frame, and 25ms may be taken as the width of a frame in the embodiment of the present invention, and for a 44.1kHz sampled signal, a frame contains 0.040 × 44100 ═ 1764 samples, and the frame shift is taken as 20ms, allowing for a 20ms overlap between every two frames. Thus, the first frame is from the 1 st to 1764 th sample point, and the second frame is from the 883 rd to 2646 th sample point, up to the last sample point, and finally 0 is complemented if the audio length cannot be divided by the frame number. For a 15 s audio data, 44100 × 15/882 ═ 750 frames can be obtained.

(3) Performing a Discrete Fourier Transform (DFT) on each of the at least one frame segment to determine a power spectrum for each frame segment.

Wherein determining the power spectrum of each frame segment can be achieved by the following equations (1) and (2):

in fact, DFT transform is two "correlation" operations, one is related to a cos sequence with frequency k and one is related to a sin sequence with frequency k of the audio data, and then the superposition of the two is the result of the correlation with the sine wave with frequency k, and if the obtained value is large, it indicates that the audio data contains a large amount of energy with frequency k.

(4) And performing data conversion on the power spectrum to obtain an audio characteristic vector.

For example, the conversion equation (3) for calculating the Mel-spaced filter bank frequency and Mel frequency is:

M(f)＝1125ln(1+f/700)

M^-1(m)＝700(exp(m/1125)-1) (3)

the Mel-spaced filter bank is a group of filter banks with nonlinear distribution, which are densely distributed in the low frequency part and sparsely distributed in the high frequency part, and the distribution is to better satisfy the auditory characteristics of human ears. Then, log is taken from the 128-dimensional Mel power spectrum determined in the above formula (3), and 128-dimensional filter bank energies log-Mel filter bank energies (i.e. the k-energy in step (3)) are obtained. The reason for this is that the non-linear relationship log represents more accurately because the human ear's perception of sound is not linear.

Based on the above steps (1) - (4), sometimes in order to make the obtained audio feature vector more accurate, in a possible example, before the step (3), the method may further include:

each frame segment is smoothed by a hamming window.

Here, the purpose of windowing is to smooth the signal, and if the signal is smoothed using a hamming window, the side lobe size after FFT and the spectral leakage are reduced compared to the rectangular window function.

In the embodiment of the present invention, a formula (4) for windowing a signal using a hamming window (windowing) is as follows:

thereby, a piece of audio data is converted into a set of time-sequenced audio feature vectors.

Based on this, here, in one possible embodiment, the speech recognition model may include at least one sub-model of: a transcription network model, a prediction network model, and a joint network model.

When the speech recognition model includes a transcription network model, this step 310 may specifically include:

And/or, when the speech recognition model includes a predictive network model, this step 310 may specifically include:

under the condition that the voice recognition model is trained for the Nth time, inputting a similarity prediction result output from the training for the N-1 times into a prediction network model to obtain the audio characteristic information for the Nth time;

It should be noted that the above two cases can be operated in a superposition manner, that is, when the speech recognition model includes a transcription network model and a prediction network model, the above steps can be used to determine semantic information.

To further explain this step, it can be exemplified as follows:

the speech recognition model related in the embodiment of the invention is obtained on the basis of a Recurrent Neural Network Transducer (RNN-T) model. The model of the speech recognition model is mainly characterized in that seamless combination of sound and language clues is realized, and speaker classification and speech recognition are integrated into the same system. Compared with a single recognition system of the same type, the integrated model can not greatly reduce the performance of voice recognition, but can greatly improve the effect of speaker distinguishing.

This integrated speech recognition model can be trained like a speech recognition system. The data of the training reference includes a phonetic transcription of the speaker and a label that distinguishes the speaker. For example, "when a job was handed over? "student >," I wish you to submit in the next day before class, "" teacher >,. When training the model using audio and corresponding reference transcript examples, the user may enter more dialog recordings and obtain similar forms of output.

Step 320: and inputting the semantic information and the audio characteristic information into the voice recognition model, and performing iterative training on the voice recognition model until preset training conditions are met to obtain a trained target voice recognition model.

Wherein the following steps are respectively executed for each voice training sample: inputting the semantic information and the audio characteristic information into a voice recognition model to obtain a similarity prediction result of the semantic information and the audio characteristic information; adjusting the voice recognition model according to each similarity prediction result; and performing iterative training on the adjusted voice recognition model according to the voice training sample until a preset training condition is met to obtain a trained target voice recognition model.

Here, based on the above possibility in step 310, when the speech recognition model further includes a joint network model, step 320 may specifically include:

Thus, to further explain how to obtain the similarity prediction result of the semantic information and the audio feature information, the following describes the following steps with reference to two sub-models involved in step 310:

as shown in fig. 4, when a speech recognition model (e.g., a specific implementation of the Joint ASR + SD model in fig. 2) may include a Transcription Network model (Transcription Network), a Prediction Network model (Prediction Network), and a Joint Network model (Joint Network), each sub-model is introduced separately.

(1) Transcription network model

The transcription network model is also called encoder, which receives the audio feature vector processed in step 310 as input, and outputs an intermediate variable after neural network training

The variable contains semantic information of audio data, and can be used in the subsequent processThe method is used for training the text information corresponding to the voice, namely determining the text data corresponding to the audio data.

(2) Predictive network model

The main function of the prediction network model is to obtain the speaker characteristics, which receives the output of the last joint network model as input, and outputs the intermediate variable after the training of the neural network layer

The variable contains audio characteristic information corresponding to each section of voice, namely speaker information, and can be used for training the speaker information corresponding to the voice.

Here, it is to be noted that, when there is no output of the previous combined network model, i.e., the first training, the preset similarity prediction result is input into the prediction network model to obtain the audio feature information; under the condition that the speech recognition model is trained for the Nth time, inputting a similarity prediction result output from the Nth training into a prediction network model to obtain the Nth audio characteristic information; and N is an integer greater than 1, and the audio characteristic information is used for determining the identity information of the target object.

(3) Federated network model

The combined network model receives the output results of the transcription network model and the prediction network model, and combines the output results

And

and as input, obtaining a similarity prediction result corresponding to each label after training through a neural network layer, and inputting the similarity prediction result into the prediction network model again. This is a feedback loop in the model where previously recognized words are fed back as input and the RNN-T model is able to integrate linguistic cues, such as the end of a problem, which is also a core reason to enable speaker discrimination. In order to obtain the final corresponding characters and speakers, in the embodiment of the invention, the characters and the speakers can be directly selectedAnd selecting the label group with the maximum global probability or integrating all time periods.

Further, to better illustrate how to train the speech recognition model in the embodiment of the present invention, a specific example is given as follows:

as shown in fig. 4, the input symbol sequence X ═ X of the transcription network model₁,x₂,...x_T]Representing where t represents the number of symbols in the sequence, corresponding to the audio clip number, x_tE d is the characteristic obtained by the Mel filter, d equals 80. The corresponding predictive network model may be represented by the symbol sequence Y ═ Y₁,y₂,...y_U]Representation, including results of speech recognition and speaker labeling, where y_uE.g., Ω is the full output space of the RNN-T network. The core function of the training is shown by formula (5):

based on this, the three main transcription network model, prediction network model and joint network model in the speech recognition model are described in detail below.

(1)Transcription Network

The audio feature vector is taken as input, with dimension 80. For ease of training, the long audio is divided into audio segments of up to 15 seconds, each audio segment having a number of people who may be speaking. Since longer units are more suitable for speech recognition, the time resolution of the output sequence can be reduced, thereby improving the efficiency of training and reasoning. For this purpose, the embodiment of the present invention adopts a hierarchical structure of a Time Delay Neural Network (TDNN) layer, which reduces the time resolution from 10ms to 80 ms. The architecture is very similar to the encoder for the CTC word model, and this decimation increases the inference speed and reduces the recognition error rate.

Specifically, the transformation Network model consists of three identical blocks consisting of four layers as shown in fig. 5:

(1) a one-dimensional temporal convolution layer with 512 filters, the convolution layer having a kernal size of 5, plus a max boosting operator of 2; (2) three layers of bidirectional long short term neural networks (LSTM) with 512 cells. The transformation Network model was trained using a random gradient-based ADAM optimizer.

(2)Prediction Network

The Prediction Network model receives the previous result y_u-1As input, it is first composed by a word embedding layer, which can map 4096 units of morpheme vocabulary to 512-dimensional vector space; the output of this space is then used as input to the LSTM layer, which has 1024 cells; and finally followed by a fully connected layer of 512 cells. This process can be expressed as in equation (6):

a single layer LSTM network may be represented by fig. 6, and mainly comprises the following parts:

forgetting door of LSTM

The forgetting gate (forget gate) controls whether to forget, and in LSTM, controls whether to forget the state of the hidden cell in the previous layer with a certain probability.

LSTM input gate

The next step is to decide how much new information to add to the cell state. This need is accomplished by two processes: firstly, a sigmoid layer called an input gate layer determines which information needs to be updated; a tanh layer generates a vector, i.e. the content that is to be updated that is to be selected.

Cellular state renewal of LSTM

Before studying the output gate of LSTM, we looked at the cellular state of LSTM. The results of the forgoing forgetting gate and the entry gate both contribute to the cell state C (t). We look at how C (t) is obtained from the cellular state C (t-1) C (t-1).

Output gate of LSTM

With the new hidden cell state C (t), we can see that the output gate is output, the updating of the hidden state h (t) is composed of two parts, the first part is o (t), which is obtained by the previous sequence of hidden state h (t-1) h (t-1) and the sequence data x (t), and the activation function sigmoid, and the second part is composed of the hidden state C (t) and the tanh activation function.

(3)Joint Network

As shown in fig. 7, the input transport Network and Prediction Network outputs of the Joint Network model are merged and then input into a fully connected neural Network layer, which has 512 hidden cells, and then the result is output into a softmax layer having 4096 cells, resulting in final results y1, y2 and y 3. The value of the output layer, i.e. the label to be trained, is set as a combination of characters and speakers, and the implementation manner can be as follows:

hello dr jekyll<spk:pt>

hello mr hyde what brings you here today<spk:dr>

I am struggling again with my bipolar disorder<spk:pt>

here, it should be noted that the preset training condition in the embodiment of the present invention may be determined to meet the preset training condition when the number of iterations satisfies a preset threshold (that is, a maximum limit number is reached), or may be determined to meet the preset training condition when an accuracy of a similarity preset result and an actual value reaches a preset threshold in an iteration process.

Therefore, in the embodiment of the invention, the semantic information and the audio characteristic information corresponding to the audio data are obtained by analyzing the audio data; and then, training the voice recognition model according to the semantic information and the audio characteristic information, so that even when the dialogue audio is divided into a plurality of segments, the target object can be determined according to the audio characteristic information, and the identity characteristic of the target object is recognized according to the semantic information, so that the target object can be accurately tracked in the dialogue audio, the accuracy of recognizing the target object in the audio is improved, and under the condition that the target object is recognized, the identity information of the target object is determined, so that the application scene of the dialogue audio is obtained.

In conclusion, the embodiment of the invention researches the speaker distinguishing process by combining the language information, fully utilizes the known information and improves the identification precision. In addition, since the above method does not require forced alignment, the text sequence itself can be used for learning training. Based on the RNN-T model, decoding is accelerated, and a large amount of blanks exist, so that the model can use frame skipping operation in the decoding process, and the decoding process is greatly accelerated. Due to monotonicity, the method can perform real-time online decoding, and the range of application scenes is enlarged.

In addition, the embodiment of the invention also provides a speech recognition method based on the trained speech recognition model.

Fig. 8 is a flowchart of a speech recognition method according to an embodiment of the present invention.

As shown in fig. 8, the method may specifically include:

step 810, obtaining target audio data.

Here, in one possible embodiment, the received audio data is preprocessed to obtain target audio data;

wherein the pre-processing comprises data cleansing and/or noise reduction.

Step 820, inputting the target audio data into the target voice recognition model determined in the step 320 to obtain dialogue information; wherein,

Based on the above two processes, the embodiment of the present invention further provides two devices, namely, a speech recognition model training device and a speech recognition device, which are specifically shown as follows.

Fig. 9 is a schematic structural diagram of a training apparatus for a speech recognition model according to an embodiment of the present invention.

As shown in fig. 9, the training apparatus 90 for a speech recognition model may specifically include:

a processing module 901, configured to determine a voice training sample according to audio data of a target object, where the voice training sample includes semantic information and audio feature information;

the generating module 902 inputs semantic information and audio characteristic information into a speech recognition model, and performs iterative training on the speech recognition model until a preset training condition is met to obtain a trained target speech recognition model.

The generating module 902 may specifically be configured to, for each speech training sample, respectively perform the following steps: inputting the semantic information and the audio characteristic information into a voice recognition model to obtain a similarity prediction result of the semantic information and the audio characteristic information; adjusting the voice recognition model according to each similarity prediction result; and performing iterative training on the adjusted voice recognition model according to the voice training sample until a preset training condition is met to obtain a trained target voice recognition model.

In one possible embodiment, the speech recognition model includes a transcription network model. Based on this, the processing module 901 in the embodiment of the present invention may specifically include:

In another possible embodiment, the speech recognition model includes a predictive network model; based on this, in the embodiment of the present invention, the generating module 902 inputs the preset similarity prediction result into the prediction network model to obtain the audio feature information under the condition of performing the first training on the speech recognition model;

under the condition that the speech recognition model is trained for the Nth time, inputting a similarity prediction result output from the training for the (N-1) th time into a prediction network model to obtain the audio characteristic information for the Nth time;

In yet another possible embodiment, the speech recognition model further comprises a federated network model; the generating module 902 in the embodiment of the present invention may specifically be configured to input semantic information and audio feature information into a joint network model, so as to obtain hidden data including text information of audio data and identity information of a target object;

In addition, the training apparatus 90 for speech recognition model may further include a determining module 904 for determining an audio feature vector according to the audio data of the target object by means of mel-frequency cepstrum coefficients MFCC.

In a possible embodiment, the determining module 904 may be specifically configured to obtain audio data of the target object; performing framing processing on the oscillogram of the audio data to obtain at least one frame segment; performing Discrete Fourier Transform (DFT) on each frame segment in the at least one frame segment to determine a power spectrum of each frame segment; and performing data conversion on the power spectrum to obtain an audio characteristic vector.

Based on this, the training device 90 for the speech recognition model may further include a transformation module 905 for performing a smoothing process on each frame segment through a hamming window.

Fig. 10 is a schematic structural diagram of a speech recognition apparatus using a target speech recognition model according to an embodiment of the present invention.

As shown in fig. 10, the speech recognition apparatus 100 may specifically include:

an obtaining module 1001 configured to obtain target audio data;

the processing module 1002 is configured to input target audio data into a target speech recognition model to obtain dialog information; wherein,

The obtaining module 1001 may be specifically configured to perform preprocessing on received audio data to obtain target audio data; wherein the pre-processing comprises data cleansing and/or noise reduction.

The terminal device 1100 includes, but is not limited to: radio frequency unit 1101, network module 1102, audio output unit 1103, input unit 1104, sensor 1105, display unit 1106, user input unit 1107, interface unit 1108, memory 1109, processor 1110, and power supply 1111. Those skilled in the art will appreciate that the terminal device configuration shown in fig. 11 does not constitute a limitation of the terminal device, and that the terminal device may include more or fewer components than shown, or combine certain components, or a different arrangement of components. In the embodiment of the present invention, the terminal device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 1101 may be configured to receive and transmit signals during a message transmission or a call, and specifically, receive downlink resources from a base station and then process the received downlink resources to the processor 1110; in addition, the uplink resource is transmitted to the base station. In general, radio frequency unit 1101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 1101 may also communicate with a network and other devices through a wireless communication system.

The terminal device provides wireless broadband internet access to the user through the network module 1102, such as helping the user send and receive e-mails, browse web pages, and access streaming media.

The audio output unit 1103 may convert an audio resource received by the radio frequency unit 1101 or the network module 1102 or stored in the memory 1109 into an audio signal and output as sound. Also, the audio output unit 1103 can also provide audio output related to a specific function performed by the terminal device 1100 (e.g., a call signal reception sound, a message reception sound, and the like). The audio output unit 1103 includes a speaker, a buzzer, a receiver, and the like.

The input unit 1104 is used to receive audio or video signals. The input Unit 1104 may include a Graphics Processing Unit (GPU) 11041 and a microphone 11042, and the Graphics processor 11041 processes image resources of still pictures or videos obtained by an image capturing apparatus (such as a camera head) in a video capturing mode or an image capturing mode. The processed image frame may be displayed on the display unit 1107. The image frames processed by the graphic processor 11041 may be stored in the memory 1109 (or other storage medium) or transmitted via the radio frequency unit 1101 or the network module 1102. The microphone 11042 may receive sound and may be capable of processing such sound into an audio asset. The processed audio resources may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 1101 in case of the phone call mode.

Terminal device 1100 also includes at least one sensor 1105, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that adjusts the brightness of the display panel 11061 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 11061 and/or the backlight when the terminal device 1100 moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the terminal device attitude (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration identification related functions (such as pedometer, tapping), and the like; the sensors 1105 may also include fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., and are not described in detail herein.

The display unit 1106 is used to display information input by a user or information provided to the user. The Display unit 1106 may include a Display panel 11061, and the Display panel 11061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 1107 may be used to receive input numeric or character information and generate key signal inputs associated with user settings and function control of the terminal apparatus. Specifically, the user input unit 1107 includes a touch panel 11071 and other input devices 11072. The touch panel 11071, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 11071 (e.g., operations by a user on or near the touch panel 11071 using a finger, a stylus, or any other suitable object or attachment). The touch panel 11071 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1110, and receives and executes commands from the processor 1110. In addition, the touch panel 11071 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The user input unit 1107 may include other input devices 11072 in addition to the touch panel 11071. In particular, the other input devices 11072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.

Further, the touch panel 11071 can be overlaid on the display panel 11061, and when the touch panel 11071 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 1110 to determine the type of the touch event, and then the processor 1110 provides a corresponding visual output on the display panel 11061 according to the type of the touch event. Although in fig. 11, the touch panel 11071 and the display panel 11061 are implemented as two independent components to implement the input and output functions of the terminal device, in some embodiments, the touch panel 11071 and the display panel 11061 may be integrated to implement the input and output functions of the terminal device, and is not limited herein.

The interface unit 1108 is an interface for connecting an external device to the terminal apparatus 1100. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless resource port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. Interface unit 1108 may be used to receive input (e.g., resource information, power, etc.) from external devices and transmit the received input to one or more elements within terminal apparatus 1100 or may be used to transmit resources between terminal apparatus 1100 and external devices.

The memory 1109 may be used to store software programs and various resources. The memory 1109 may mainly include a storage program area and a storage resource area, where the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like; the storage resource region may store resources (such as audio resources, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory 1109 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state memory device.

The processor 1110 is a control center of the terminal device, connects various parts of the entire terminal device by using various interfaces and lines, and performs various functions and processing resources of the terminal device by operating or executing software programs and/or modules stored in the memory 1109 and calling resources stored in the memory 1109, thereby integrally monitoring the terminal device. Processor 1110 may include one or more processing units; preferably, the processor 1110 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1110.

Terminal device 1100 can also include a power supply 1111 (e.g., a battery) for providing power to various components, and preferably, power supply 1111 can be logically coupled to processor 1110 via a power management system such that functions such as managing charging, discharging, and power consumption are performed via the power management system.

In addition, the terminal device 1100 includes some functional modules that are not shown, and are not described in detail herein.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer is caused to execute the training method of the speech recognition model or the steps of the speech recognition method according to the embodiments of the present invention.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for training a speech recognition model, the method comprising:

determining a voice training sample according to audio data of a target object, wherein the voice training sample comprises semantic information and audio characteristic information;

and inputting the semantic information and the audio characteristic information into a voice recognition model, and performing iterative training on the voice recognition model until preset training conditions are met to obtain a trained target voice recognition model.

2. The method according to claim 1, wherein the inputting the semantic information and the audio feature information into a speech recognition model, and performing iterative training on the speech recognition model until a preset training condition is satisfied to obtain a trained target speech recognition model comprises:

3. The method of claim 2, wherein the speech recognition model comprises a transcription network model; the determining a speech training sample according to the audio data of the target object includes:

inputting the audio characteristic vector of the audio data into the transcription network model to obtain the semantic information;

4. The method of claim 2, wherein the speech recognition model comprises a predictive network model; the determining a speech training sample according to the audio data of the target object includes:

under the condition of carrying out first training on the voice recognition model, inputting a preset similarity prediction result into the prediction network model to obtain the audio characteristic information;

under the condition that the speech recognition model is trained for the Nth time, inputting a similarity prediction result output from the training for the (N-1) th time into the prediction network model to obtain the audio feature information for the Nth time;

wherein N is an integer greater than 1, and the audio feature information is used to determine identity information of the target object.

5. The method of claim 4, wherein the speech recognition model further comprises a federated network model;

inputting the semantic information and the audio characteristic information into a speech recognition model to obtain a similarity prediction result of the semantic information and the audio characteristic information, wherein the similarity prediction result comprises the following steps:

inputting the semantic information and the audio characteristic information into the combined network model to obtain hidden data comprising text information of the audio data and identity information of the target object;

and inputting the hidden data into a classification model to obtain a similarity prediction result of the text information and the identity information.

6. The method of claim 3, further comprising:

determining the audio feature vector by means of Mel frequency cepstrum coefficients MFCC from the audio data of the target object.

7. The method of claim 6, wherein the determining the audio feature vector from the audio data of the target object by means of Mel Frequency Cepstral Coefficients (MFCCs) comprises:

acquiring audio data of the target object;

and performing data conversion on the power spectrum to obtain the audio characteristic vector.

8. A speech recognition method using a target speech recognition model trained by the method of any one of claims 1-6, the method comprising:

acquiring target audio data;

inputting the target audio data into the target voice recognition model to obtain dialogue information; wherein,

9. The method of claim 8, wherein the obtaining target audio data comprises:

preprocessing the received audio data to obtain the target audio data;

wherein the pre-processing comprises data cleansing and/or noise reduction.

10. An apparatus for training a speech recognition model, the apparatus comprising:

and the generation module is used for inputting the semantic information and the audio characteristic information into a voice recognition model, and performing iterative training on the voice recognition model until preset training conditions are met to obtain a trained target voice recognition model.

11. A speech recognition apparatus using a target speech recognition model trained by the method of claim 1, the apparatus comprising:

the acquisition module is used for acquiring target audio data;

12. A terminal device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing a method of training a speech recognition model according to any one of claims 1 to 7 or a method of speech recognition using a target speech recognition model according to any one of claims 8 to 9.

13. A computer-readable storage medium, having stored thereon a computer program which, if executed in a computer, causes the computer to execute a method of training a speech recognition model according to any one of claims 1 to 7, or a method of speech recognition using a speech recognition model according to any one of claims 8 to 9.