CN112086087A

CN112086087A - Speech recognition model training method, speech recognition method and device

Info

Publication number: CN112086087A
Application number: CN202010961964.8A
Authority: CN
Inventors: 唐浩雨
Original assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2020-12-15
Anticipated expiration: 2040-09-14
Also published as: CN112086087B

Abstract

The embodiment of the invention discloses a speech recognition model training method, a speech recognition method and a speech recognition device, which comprise the following steps: acquiring a first voice sequence of an unlabeled text and a second voice sequence of an annotated text; inputting the first voice sequence into a coding network to obtain a first coding characteristic of a voice unit in the first voice sequence and a content characteristic of a specified voice unit; predicting a second encoding characteristic of the speech unit subsequent to the specified speech unit based on the content characteristic; calculating a contrast coding loss according to the first coding feature and the second coding feature of the voice unit after the designated voice unit so as to train the coding network; after the coding network is trained, the second speech sequence is input into the coding network to train the coding network and the decoding network. The coding network is trained by calculating the contrast coding loss through the first coding feature and the second coding feature, and the coding network can be trained by using label-free data, so that the number of labeled data is reduced, and the cost for acquiring the data is reduced.

Description

Speech recognition model training method, speech recognition method and device

Technical Field

Embodiments of the present invention relate to the field of speech recognition technologies, and in particular, to a speech recognition model training method, a speech recognition model training apparatus, a speech recognition apparatus, an electronic device, and a storage medium.

Background

In a live broadcast platform, the content of a large number of live broadcasts in a live broadcast room often needs to be supervised, the supervised objects comprise images and voice, and the voice in the live broadcasts mainly comes from the voice formed by the speaking of the live broadcasts. For the supervision of speech content, speech is usually recognized as text, and then the text is screened.

In the prior art, a speech is input into a trained speech recognition model to obtain a corresponding text, where the speech recognition model includes a coding network and a decoding network, the coding network codes an input speech to obtain speech features, and the decoding network decodes the coded speech features to obtain a text. When a speech recognition model is trained, an encoding network and a decoding network need to be trained, and a loss function needs to be calculated when both the encoding network and the decoding network are trained, specifically, a label is obtained by labeling speech data, the decoding network and the encoding network are trained by using the speech data with the label, a loss rate needs to be calculated by the label of the training data in the process of training the encoding network, and a loss rate needs to be calculated by the label of the training data in the process of training the decoding network and the encoding network together, that is, the whole training process needs to rely on a large amount of labeled speech data, so that a large amount of unlabeled speech data cannot be utilized, and the cost for acquiring the training data is increased.

Disclosure of Invention

The embodiment of the invention provides a speech recognition model training method, a speech recognition device, electronic equipment and a storage medium, and aims to solve the problem that the training data cost is high due to the fact that a large amount of unmarked data cannot be used because the existing training speech recognition model training process depends on marked data in the whole process.

In a first aspect, an embodiment of the present invention provides a method for training a speech recognition model, including:

acquiring a training data set, wherein the training data set comprises a first voice sequence without a labeled text and a second voice sequence with a labeled text;

inputting the first voice sequence into an initialized coding network to obtain a first coding characteristic of a voice unit in the first voice sequence and a content characteristic of a specified voice unit;

predicting a second encoding characteristic of a speech unit subsequent to the specified speech unit based on the content characteristic;

calculating a contrast coding loss according to the first coding feature and the second coding feature of the voice unit after the specified voice unit so as to train the coding network;

after the coding network is trained, inputting the second voice sequence into the coding network to train the coding network and the initialized decoding network, wherein the trained coding network and decoding network are used as voice recognition models.

In a second aspect, an embodiment of the present invention provides a speech recognition method, including:

acquiring voice data to be recognized;

inputting the voice data to be recognized into a pre-trained voice recognition model to obtain a recognition text;

the speech recognition model is trained by the speech recognition model training method according to any embodiment of the invention.

In a third aspect, an embodiment of the present invention provides a speech recognition model training apparatus, including:

the training data set acquisition module is used for acquiring a training data set, wherein the training data set comprises a first voice sequence without a labeled text and a second voice sequence with a labeled text;

the coding network coding module is used for inputting the first voice sequence into the initialized coding network to obtain a first coding characteristic of a voice unit in the first voice sequence and a content characteristic of a specified voice unit;

the coding feature prediction module is used for predicting a second coding feature of the voice unit after the specified voice unit according to the content feature;

the coding network training module is used for calculating contrast coding loss according to the first coding feature and the second coding feature of the voice unit after the specified voice unit so as to train the coding network;

and the coding network and decoding network training module is used for inputting the second voice sequence into the coding network after the coding network is trained so as to train the coding network and the initialized decoding network, and the trained coding network and decoding network are used as voice recognition models.

In a fourth aspect, an embodiment of the present invention provides a speech recognition apparatus, including:

the voice data to be recognized acquisition module is used for acquiring voice data to be recognized;

the voice recognition module is used for inputting the voice data to be recognized into a pre-trained voice recognition model to obtain a recognition text;

In a fifth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the speech recognition model training method, and/or the speech recognition method, of any of the embodiments of the present invention.

In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a speech recognition model training method and/or a speech recognition method according to any embodiment of the present invention.

The speech recognition model comprises a coding network and a decoding network, when the coding network is trained, a first speech sequence without a labeled text is input into the initialized coding network to obtain a first coding feature of a speech unit in the first speech sequence and a content feature of a specified speech unit, after a second coding feature of the speech unit after the specified speech unit is predicted through the content feature, a contrast coding loss is calculated according to the first coding feature and the second coding feature of the speech unit after the specified speech unit to train the coding network, and finally the coding network and the decoding network are trained through a second speech sequence with a labeled text to obtain a final speech recognition model. The second coding characteristic is predicted through the content characteristic when the coding network is trained, and the network parameter of the coding network is adjusted by calculating the contrast coding loss through the first coding characteristic and the second coding characteristic, so that the first voice sequence does not need to be labeled with texts, a large amount of voice data without labeled texts can be used as the first voice sequence to train the coding network, the quantity of training data with text labels required when a voice recognition model is trained is reduced, and the cost of the training data is reduced.

Drawings

FIG. 1 is a schematic diagram of a speech recognition model in the prior art;

FIG. 2 is a flowchart illustrating steps of a method for training a speech recognition model according to an embodiment of the present invention;

FIG. 3A is a flowchart illustrating steps of a method for training a speech recognition model according to a second embodiment of the present invention;

FIG. 3B is a schematic diagram of an encoding network of an embodiment of the present invention;

FIG. 4 is a flowchart illustrating steps of a speech recognition method according to a third embodiment of the present invention;

fig. 5 is a block diagram of a speech recognition model training apparatus according to a fourth embodiment of the present invention;

fig. 6 is a block diagram of a speech recognition apparatus according to a fifth embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures. The embodiments and features of the embodiments in the present application may be combined with each other without conflict.

FIG. 1 is a diagram of a speech recognition model in the prior art.

As shown in FIG. 1, the speech recognition model is an end-to-end neural network, and generally has a network structure as shown in FIG. 1, which includes an encoding network (Encoder), a aligned network (CTC), and an encoding-decoder (ATTENTION-DECODER), in FIG. 1, O_nFor the input speech signal, the blocks in Encoder are neural networks of Encoder, speech signal O_nOutputting implicit characteristic h after passing through neural network of Encoder_nImplicit feature h_nComputing CTC loss function and identified word y as input to CTC_nThe character y to be recognized is required for calculating the CTC loss function_nAnd a speech signal O_nThe marked text is compared and loss is calculated, and the hidden characteristic h_nAlso as input to the attribute-decoder to calculate the text y by decoder_nAlso by means of speech signals O_nAnd comparing the marked texts to calculate the loss ATT loss function. Therefore, when the speech recognition model is trained in the prior art, the loss is calculated by using the labeled text of the speech signal in training both the coding network and the decoding network, a large amount of speech data with labeled text is undoubtedly needed, the cost of training data is increased, and in order to solve the problem, the first embodiment and the second embodiment of the invention provide the speech recognition model training method.

Example one

Fig. 2 is a flowchart of steps of a method for training a speech recognition model according to an embodiment of the present invention, where the method is applicable to training a speech recognition model, and the method can be executed by a speech recognition model training apparatus according to an embodiment of the present invention, and the speech recognition model training apparatus can be implemented by hardware or software and integrated in an electronic device according to an embodiment of the present invention, and specifically, as shown in fig. 2, the method for training a speech recognition model according to an embodiment of the present invention can include the following steps:

s201, a training data set is obtained, wherein the training data set comprises a first voice sequence without a labeled text and a second voice sequence with a labeled text.

Specifically, the speech sequence may be an ordered sequence of speech units, and the speech units are connected in time sequence to obtain the speech sequence. In practical application, any voice data can be acquired, and the voice data is divided into a plurality of voice units according to the preset duration, so that a voice sequence can be obtained. For each voice sequence, if the corresponding text is marked, the second voice sequence with the marked text is obtained, and the text which is not marked is the first voice sequence without the marked text, wherein the text is the text capable of expressing the semantics of the voice sequence.

S202, inputting the first voice sequence into the initialized coding network to obtain the first coding feature of the voice unit in the first voice sequence and the content feature of the appointed voice unit.

In the embodiment of the invention, the voice recognition model comprises a coding network and a decoding network, wherein the coding network codes an input voice sequence, extracts the coding characteristics of the voice sequence, and the decoding network decodes the coding characteristics to obtain a corresponding text.

In particular, in the embodiment of the present invention, the coding network includes a primary coding network and a secondary coding network, wherein, the first voice sequence is input into the primary coding network to obtain the first coding characteristics of each voice unit in the first voice sequence, for each voice unit, the first coding feature of the voice unit and the state quantity of the previous voice unit of the voice unit can be input into a secondary coding network to obtain the content feature corresponding to the voice unit, the specified speech unit may refer to a speech unit other than the first and last speech units in the first speech sequence, the state quantity may be a state after a certain speech unit is encoded by the secondary coding network, and the content feature of the specified speech unit may be a result of encoding, by the secondary coding network, the first encoding feature of the specified speech unit and the first encoding features of a plurality of speech units before the specified speech unit.

And S203, predicting the second coding characteristics of the voice unit behind the appointed voice unit according to the content characteristics.

In the embodiment of the invention, the primary coding network codes the first voice sequence according to the preset step length, a plurality of voice units can be coded when the first voice sequence is coded according to the preset step length, after the first speech sequence is encoded by a step size, the last speech unit in the step size encoding can be determined as the designated speech unit, and the second encoding characteristic of the speech unit included in the next step size can be further predicted according to the content characteristic of the designated speech unit, in one example, a linear matrix may be set, by which the content characteristics of a given speech unit are multiplied, resulting in a second coding characteristic for the speech unit contained in the next step, that is, the second coding characteristic of the speech unit included in the next step is the second coding characteristic of the speech unit after the specified speech unit.

S204, calculating contrast coding loss according to the first coding feature and the second coding feature of the voice unit after the specified voice unit so as to train the coding network.

Specifically, for each first speech unit after the specified speech unit, the first coding feature is obtained by coding the first speech unit through a primary coding network, and the second coding feature of the first speech unit can be predicted by the content feature of the specified speech unit. A positive sample team example can be constructed by adopting the first coding features and the second coding features of the first voice unit, a plurality of negative sample team examples are constructed by adopting the first coding features of a plurality of second voice units and the second coding features of the first voice unit in sequence, then the similarity of the two coding features in the positive sample team example and the negative sample team example is calculated, the contrast coding loss rate of the first voice unit is calculated through the similarity, finally, the average value of the contrast coding loss rates of a plurality of first voice units behind the appointed voice unit is calculated to obtain the loss rate of the training in the current round, and whether the loss rate is smaller than a preset threshold value is judged; if so, stopping training the coding network; if not, adjusting the network parameters of the coding network according to the loss rate, returning to the step of inputting the first voice sequence into the primary coding network of the coding network to obtain the first coding feature of each voice unit in the first voice sequence, and obtaining the trained coding network when the loss rate is smaller than the preset threshold value.

S205, after the coding network is trained, inputting the second voice sequence into the coding network to train the coding network and the initialized decoding network, wherein the trained coding network and decoding network are used as voice recognition models.

After the coding network is trained, the coding network and the decoding network are connected to form a speech recognition model, in one example, the coding network comprises a primary coding network and a secondary coding network, an output layer of the primary coding network and an input layer of the decoding network can be connected to form the speech recognition model, when the speech recognition model is trained, a second speech sequence is input to the input layer of the primary coding network to obtain a predicted text after the decoding network decodes, a loss rate is calculated through the predicted text and a labeled text of the second speech sequence, and network parameters of the primary coding network and the decoding network are adjusted according to the loss rate until the loss rate is smaller than a preset value.

The speech recognition model comprises a coding network and a decoding network, when the coding network is trained, a first speech sequence without a labeled text is input into the initialized coding network to obtain a first coding feature of a speech unit in the first speech sequence and a content feature of a specified speech unit, after a second coding feature of the speech unit after the specified speech unit is predicted through the content feature, a contrast coding loss is calculated according to the first coding feature and the second coding feature of the speech unit after the specified speech unit to train the coding network, and finally the coding network and the decoding network are trained through a second speech sequence with a labeled text to obtain a final speech recognition model. The second coding characteristic is predicted through the content characteristic when the coding network is trained, and the coding network is trained through calculating the contrast coding loss through the first coding characteristic and the second coding characteristic, so that the first voice sequence does not need to be labeled with texts, a large amount of voice data without labeled texts can be used as the first voice sequence, the quantity of training data with text labels required when a voice recognition model is trained is reduced, and the cost of the training data is reduced.

Example two

Fig. 3A is a flowchart of steps of a speech recognition model training method according to a second embodiment of the present invention, which is optimized based on the first embodiment of the present invention, and specifically, as shown in fig. 3A, the speech recognition model training method according to the second embodiment of the present invention may include the following steps:

s301, a training data set is obtained, wherein the training data set comprises a first voice sequence without a labeled text and a second voice sequence with a labeled text.

Specifically, the first speech sequence is not labeled with a text expressing the semantics of the first speech sequence, the second speech sequence is labeled with a text expressing the semantics of the first speech sequence, and the number of the first speech sequence and the second speech sequence may be multiple in the training data set.

S302, inputting the first voice sequence into a primary coding network of the initialized coding network to obtain a first coding feature of each voice unit in the first voice sequence.

In the embodiment of the invention, the coding network comprises a primary coding network and a secondary coding network, wherein the primary coding network codes the voice units in the first voice sequence to obtain the first coding characteristics, and the secondary coding network codes the voice units by taking the first coding characteristics and the state quantity as input to obtain the content characteristics.

Fig. 3B is a schematic diagram of a coding network, and in fig. 3B, a first speech sequence X ═ X₁，x₂，......x_N}f_encFor a primary coding network, f_arFor secondary coding of networks，h_tFor the first coding feature, S_tIs a state quantity, c_tIs a content feature.

When the first speech sequence X is ═ { X ═ X₁，x₂，......x_NInput primary coding network f_encThen once encode the network f_encCoding the voice units in the first voice sequence X by a preset step length to obtain a first coding characteristic of each voice unit, namely:

h_t＝f_enc(x_t)

in one example, the network f is once encoded_encThe first coding network may be a vggnet (visual Geometry Group network), and of course, the first coding network may also be other neural networks, which is not limited in this embodiment of the present invention.

S303, aiming at each voice unit, inputting the first coding feature of the voice unit and the state quantity of the previous voice unit of the voice unit into a secondary coding network of the coding network to obtain the content feature of the specified voice unit.

Specifically, as shown in FIG. 3B, x is applied to each phonetic unit_tPhonetic unit x_tFirst coding feature h of_tAnd the state quantity S_t-1Input secondary coding network f_arThen, a phonetic unit x is obtained_tCorresponding content characteristics c_tNamely:

c_t＝f_ar(h_t,s_t-1)

as shown in FIG. 3B, each phonetic unit x_tEach corresponding to a content feature c_tThe specified speech unit may be the last speech unit in the speech units included in the current step length when the speech units in the first speech sequence X are encoded with the preset step length, as shown in fig. 3B, where the step length is 4, and the primary coding network f is a primary coding network f_encCurrent speech unit x_t-3、x_t-2、x_t-1、x_tCoding is carried out, and then the voice unit is designated as a voice unit x_tOf course, the specified speech unit may be any one of the speech units in the first speech sequence.

At one positionIn this example, the secondary coding network f_arMay be an RNN (Recurrent Neural Network), such as an LSTM (Long Short-Term Neural Network), and may of course be other Neural networks, which is not limited in this embodiment of the present invention.

S304, multiplying the content characteristic and a preset linear matrix to obtain a second coding characteristic of the voice unit behind the appointed voice unit.

Specifically, the linear matrix W is initialized_kThe linear matrix W_kFor the network parameters to be adjusted in the training process, the second coding characteristics of the phonetic unit after the phonetic unit are designated as W_kc_tDesignating a phonetic unit as x_tDesignating a phonetic unit as x_tThe subsequent phonetic unit is x_t+kThe phonetic unit is x_t+kIs characterized by a second coding

S305, calculating the contrast coding loss rate of each first voice unit after the specified voice unit by using the first coding characteristics and the second coding characteristics of the first voice unit and the first coding characteristics of a plurality of second voice units except the first voice unit.

In an optional embodiment of the present invention, for each first speech unit after the specified speech unit, the first coding feature and the second coding feature of the first speech unit are used to form a positive sample pair example, the second coding feature of the first speech unit and the first coding features of a plurality of second speech units other than the first speech unit are used to form a plurality of negative sample pair examples, the similarity between the first coding feature and the second coding feature in the positive sample pair example is calculated to obtain a first similarity, the similarity between the first coding feature and the second coding feature in the plurality of negative sample pair examples is calculated to obtain a second similarity, and the contrast coding loss rate of the first speech unit is calculated according to the first similarity and the plurality of second similarities.

Specifically, for a given phonetic unit isx_tEach first phonetic unit thereafter is x_t+kThe first phonetic unit is x_t+kOnce encoded network f_encThe first coded characteristic after coding is h_t+kBy means of content features c_tPredicting the first phonetic unit as x_t+kIs characterized by a second coding

Then x for each first phonetic unit_t+kThe positive sample is

Negative examples are

Wherein h is_tIs the first phonetic unit x_t+kA plurality of second speech units x other than the first speech unit x_tThe first speech unit x can then be calculated by the following formula_t+kContrast coding loss ratio of (1):

wherein,

to compare the coding loss rate, the first speech sequence X ═ X₁，x₂，......x_NT is the sequence number of the appointed voice unit, and t + k is the sequence number of the first voice unit after the appointed voice unit.

In the above-mentioned formula (1),

representing a primary coded network f_encFor the first speech unit x_t+kObtaining a first coding characteristic h after coding_t+kAnd by specifying phonetic unit x_tContent feature c of_tPredicting a first speech unit x_t+kSecond coding feature of

Similarity of (2), W_kIs a linear matrix;

representing a primary coded network f_encCoding the second voice unit xj to obtain a first coding characteristic h_jAnd by specifying phonetic unit x_tContent feature c of_tPredicting a first speech unit x_t+kSecond coding feature of

Similarity of (2), x_jFor the first speech sequence X except the first speech unit X_t+kAnd a second speech unit.

S306, calculating the average value of the contrast coding loss rates of the first voice units to obtain the loss rate.

In practical application, after the speech unit is specified, a plurality of first speech units may be included, and after the contrast coding loss rate of the first speech unit is calculated, the average value of the plurality of contrast coding loss rates may be calculated as the loss rate of the current training iteration of the coding network.

In order to make the process of calculating the loss rate according to the embodiment of the present invention more clearly understood by those skilled in the art, the following example is illustrated in conjunction with fig. 3B as follows:

as shown in fig. 3B, the primary coding network f_encIs 4, i.e. 4 speech units can be encoded at a time, the current step size being for speech unit x_t-3、x_t-2、x_t-1、x_tEncoding is carried out to determine the designated phonetic unit as x_tPhonetic unit x_t-3、x_t-2、x_t-1、x_tThe corresponding first coding characteristics are h_t-3、h_t-2、h_t-1、h_tPhonetic unit x_t-3、x_t-2、x_t-1、x_tCorresponding content characteristic is c_t-3、c_t-2、c_t-1、c_tSpecifying a voice noteElement x_tThe first speech unit thereafter being x_t+1、x_t+2、x_t+3The first phonetic unit is x_t+1、x_t+2、x_t+3Respectively is h_t+1、h_t+2、h_t+3By means of content features c_tSeparately predicting a first speech unit as x_t+1、x_t+2、x_t+3Is characterized by a second coding

Then for each first phonetic unit (x)_t+1、x_t+2、x_t+3) Positive and negative examples are as follows:

for the first phonetic unit x_t+1：

The positive sample is as follows

Negative examples are

Etc.;

for the first phonetic unit x_t+2：

The positive sample is as follows

Negative examples are

Etc.;

for the first phonetic unit x_t+3：

The positive sample is as follows

Negative examples are

Etc.;

for the first phonetic unit x_t+1The loss rate was calculated as follows:

calculate positive sample alignment

Obtaining a first similarity by the similarity of the two coding features, and respectively calculating negative sample examples

The similarity of the two coding features in the speech unit x is obtained by obtaining a plurality of second similarities, calculating the sum of the second similarities to obtain a second similarity sum, using the first similarity as the numerator of the formula (1), and using the second similarity sum as the denominator of the formula (1) to obtain the first speech unit x_t+1Comparing the coding loss rates, and then for the first speech unit x_t+1、x_t+2、x_t+3The loss rate of the training of the current round can be obtained by averaging the loss rates of the comparison codes.

As can be seen from the formula (1), the closer the two coding features are in the positive example sample f and the negative example sample f_k(x_t+k,c_t) The larger the value of (c), the negative example_k(x_j,c_t) The smaller the value of (A), the lower the contrast coding loss rate

The smaller the coding network is, the purpose of training the coding network is to optimize network parameters of the primary coding network and the secondary coding network and a linear matrix for prediction, so that the first coding characteristic and the second coding characteristic of each speech unit are close to each other, and finally, the contrast coding loss rate is minimum.

And S307, judging whether the loss rate is smaller than a preset threshold value.

Specifically, after each round of training is completed, whether the precision of the coding network is sufficient is determined through the loss rate, if the loss rate is judged to be smaller than the preset threshold, if so, the precision of the coding network is sufficient, S308 may be executed, and the training of the coding network is stopped, otherwise, the coding network needs to be trained continuously, and S309 is executed.

And S308, stopping training the coding network.

That is, when the loss rate is smaller than the preset threshold, the network parameters of the primary coding network in the coding network are saved, and then S310-S311 are executed.

S309, adjusting the network parameters of the coding network according to the loss rate.

Specifically, when the loss rate is greater than the preset threshold, calculating a gradient according to the loss rate, performing gradient descent on the network parameters of the primary coding network and the secondary coding network and the preset linear matrix, and then returning to S302 to retrain the coding network until the loss rate is less than the preset threshold.

According to the embodiment of the invention, the coding network is pre-trained, so that the coding network can learn the time sequence information, the loss rate is calculated without labels, the training data is not labeled, a large amount of label-free data can be used for training the coding network, the data volume of labeled data in a training speech recognition model is reduced, and the cost of the training data is reduced.

S310, adopting a primary coding network of the coding network and an initialized decoding network to construct a voice recognition model.

After the coding network is trained, a primary coding network in the coding network and an initialized decoding network can be connected to form a speech recognition model, specifically, an output layer of the primary coding network and an input layer of the initialized decoding network can be connected to obtain the speech recognition model, so that the first coding feature output by the primary coding network is input into the decoding network to be decoded to obtain the predicted text.

S311, inputting the second voice sequence into the voice recognition model, so as to train the primary coding network and the decoding network to obtain a trained voice recognition model.

After a primary coding network and a decoding network are connected to form a voice recognition model, the voice recognition model is subjected to global training, specifically, a second voice sequence can be input into the primary coding network to output a predicted text in the decoding network, the loss rate is calculated by adopting the predicted text and a labeled text of the second voice sequence, and whether the loss rate is smaller than a preset threshold value or not is judged; if so, stopping training the primary coding network and the decoding network; if not, adjusting network parameters of the primary coding network and the decoding network according to the loss rate, returning to the step of inputting the second voice sequence into the primary coding network to output the predicted text in the decoding network, and obtaining the finally trained voice recognition model which is a voice recognition model formed by the trained primary coding network and the trained decoding network after the loss rate is smaller than a preset threshold value.

The embodiment of the invention inputs a first voice sequence without a label text into a primary coding network of an initialized coding network to obtain a first coding characteristic of each voice unit in the first voice sequence, inputs the first coding characteristic of the voice unit and a state quantity of a previous voice unit of the voice unit into a secondary coding network of the coding network aiming at each voice unit to obtain a content characteristic of a specified voice unit, multiplies the content characteristic and a preset linear matrix to obtain a second coding characteristic of the voice unit after the specified voice unit, calculates a contrast coding loss rate of the first voice unit by using the first coding characteristic and the second coding characteristic of the first voice unit and the first coding characteristics of a plurality of second voice units except the first voice unit aiming at each first voice unit after the specified voice unit, and calculates the contrast coding loss rate to obtain the loss rate, and adjusting network parameters of the coding network according to the loss rate, after the primary coding network is trained, adopting the primary coding network and the decoding network to form a voice recognition model, and adopting a second voice sequence with a labeled text to train the voice recognition model. Because the second coding characteristic is predicted through the content characteristic when the coding network is trained, the coding network is trained through calculating the contrast coding loss through the first coding characteristic and the second coding characteristic, the first voice sequence does not need to label texts, a large amount of voice data without label texts can be used as the first voice sequence to train the coding network, the quantity of training data with text labels required when a voice recognition model is trained is reduced, and the cost of the training data is reduced.

EXAMPLE III

Fig. 4 is a flowchart of steps of a speech recognition method according to a third embodiment of the present invention, where the speech recognition method according to the third embodiment of the present invention is applicable to a case where speech is recognized as a text, and the speech recognition method according to the third embodiment of the present invention may be executed by a speech recognition apparatus according to an embodiment of the present invention, where the speech recognition apparatus may be implemented by hardware or software and integrated in an electronic device according to an embodiment of the present invention, and specifically, as shown in fig. 4, the speech recognition method according to the third embodiment of the present invention may include the following steps:

s401, voice data to be recognized are obtained.

In the embodiment of the present invention, the voice data to be recognized may be voice data that needs to be recognized as a text, the voice data to be recognized may be voice data on a short video or a live broadcast platform, or may also be voice data in a movie or a television show, and in addition, the language of the voice data to be recognized may be chinese, english or other languages, or even a local dialect.

S402, inputting the voice data to be recognized into a pre-trained voice recognition model to obtain a recognition text.

The embodiment of the present invention may train the speech recognition model by using the speech recognition model training method provided in the first embodiment or the second embodiment, where the speech recognition model may obtain the recognition text corresponding to the speech data after inputting the speech data, and the speech recognition model training method may refer to the first embodiment and the second embodiment, and will not be described in detail herein.

After the voice data to be recognized is obtained, preprocessing such as denoising and enhancing can be performed on the voice data to be recognized, the preprocessed voice data are divided into a plurality of voice segments to obtain a voice sequence, the voice sequence is input into a trained voice recognition model, the voice sequence is encoded through an encoding network in the voice recognition model to obtain encoding characteristics, then the encoding characteristics are decoded through a decoding network to obtain a recognition text, the recognition text can be screened to determine whether violation content exists, and therefore voice supervision is achieved.

In the embodiment of the invention, when a speech recognition model required by speech recognition is trained, a first speech sequence without a labeled text is input into an initialized coding network to obtain a first coding feature of a speech unit in the first speech sequence and a content feature of a specified speech unit, after a second coding feature of the speech unit after the specified speech unit is predicted through the content feature, a contrast coding loss is calculated according to the first coding feature and the second coding feature of the speech unit after the specified speech unit to train the coding network, and finally the coding network and a decoding network are trained through a second speech sequence with a labeled text to obtain a final speech recognition model. Because the second coding characteristic is predicted through the content characteristic when the coding network is trained, the coding network is trained through calculating the contrast coding loss through the first coding characteristic and the second coding characteristic, the first voice sequence does not need to label texts, a large amount of voice data without label texts can be used as the first voice sequence to train the coding network, the quantity of training data with text labels required when a voice recognition model is trained is reduced, and the cost of the training data is reduced.

Example four

Fig. 5 is a block diagram of a structure of a speech recognition model training apparatus according to a fourth embodiment of the present invention, and as shown in fig. 5, the speech recognition model training apparatus according to the fourth embodiment of the present invention may specifically include the following modules:

a training data set obtaining module 501, configured to obtain a training data set, where the training data set includes a first speech sequence without a labeled text and a second speech sequence with a labeled text;

a coding network coding module 502, configured to input the first voice sequence into an initialized coding network, so as to obtain a first coding feature of a voice unit in the first voice sequence and a content feature of an assigned voice unit;

an encoding feature prediction module 503, configured to predict, according to the content feature, a second encoding feature of the speech unit after the specified speech unit;

a coding network training module 504, configured to calculate a contrast coding loss according to the first coding feature and the second coding feature of the speech unit after the specified speech unit, so as to train the coding network;

and an encoding network and decoding network training module 505, configured to, after the encoding network is trained, input the second speech sequence into the encoding network to train the encoding network and the initialized decoding network, where the trained encoding network and decoding network serve as speech recognition models.

The speech recognition model training device provided by the embodiment of the invention can execute the speech recognition model training method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

Fig. 6 is a block diagram of a speech recognition apparatus according to a fifth embodiment of the present invention, and as shown in fig. 6, the speech recognition apparatus according to the fifth embodiment of the present invention may specifically include the following modules:

a to-be-recognized voice data acquisition module 601, configured to acquire to-be-recognized voice data;

a speech recognition module 602, configured to input the speech data to be recognized into a pre-trained speech recognition model to obtain a recognition text;

The voice recognition device provided by the embodiment of the invention can execute the voice recognition method provided by the third embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE six

Referring to fig. 7, a schematic structural diagram of an electronic device in one example of the invention is shown. As shown in fig. 7, the electronic device may specifically include: a processor 701, a storage device 702, a display screen 703 with touch functionality, an input device 704, an output device 705, and a communication device 706. The number of the processors 701 in the electronic device may be one or more, and one processor 701 is taken as an example in fig. 7. The processor 701, the storage device 702, the display 703, the input device 704, the output device 705, and the communication device 706 of the electronic apparatus may be connected by a bus or other means, and fig. 7 illustrates an example of connection by a bus. The electronic device is used for executing the speech recognition model training method provided by any embodiment of the invention and/or the speech recognition method.

Embodiments of the present invention also provide a computer-readable storage medium, where instructions, when executed by a processor of an electronic device, enable the device to perform a speech recognition model training method and/or a speech recognition method as described in the above method embodiments.

It should be noted that, as for the embodiments of the apparatus, the electronic device, and the storage medium, since they are basically similar to the embodiments of the method, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the embodiments of the method.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious modifications, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for training a speech recognition model, comprising:

2. The method of claim 1, wherein the coding network comprises a primary coding network and a secondary coding network, and the inputting the first speech sequence into the initialized coding network to obtain the first coding characteristics of the speech units in the first speech sequence and the content characteristics of the specified speech units comprises:

inputting the first voice sequence into a primary coding network of the initialized coding network to obtain a first coding characteristic of each voice unit in the first voice sequence;

and for each voice unit, inputting the first coding characteristic of the voice unit and the state quantity of the previous voice unit of the voice unit into a secondary coding network of the coding network to obtain the content characteristic of the specified voice unit.

3. The method of claim 1, wherein predicting the second coding feature of the speech unit subsequent to the specified speech unit based on the content feature comprises:

and multiplying the content characteristics by a preset linear matrix to obtain a second coding characteristic of the voice unit after the appointed voice unit.

4. The method according to any of claims 1-3, wherein the coding network comprises a primary coding network and a secondary coding network, and wherein the calculating a contrast coding loss from the first coding features and the second coding features of the phonetic units after the specified phonetic unit to train the coding network comprises:

for each first voice unit after the specified voice unit, calculating a contrast coding loss rate of the first voice unit by using the first coding feature of the first voice unit, the second coding feature and first coding features of a plurality of second voice units except the first voice unit;

calculating the average value of the contrast coding loss rates of the first voice units to obtain a loss rate;

judging whether the loss rate is smaller than a preset threshold value or not;

if so, stopping training the coding network;

if not, adjusting the network parameters of the coding network according to the loss rate, and returning to the step of inputting the first voice sequence into the primary coding network of the coding network to obtain the first coding feature of each voice unit in the first voice sequence.

5. The method of claim 4, wherein the calculating, for each first speech unit subsequent to the specified speech unit, the contrast coding loss ratio for the first speech unit using the first coding feature of the first speech unit, the second coding feature, and the first coding features of a plurality of second speech units other than the first speech unit comprises:

for each first voice unit, adopting the first coding characteristics and the second coding characteristics of the first voice unit to form a positive sample opposite example;

adopting the second coding characteristics of the first voice unit and the first coding characteristics of a plurality of second voice units except the first voice unit to form a plurality of negative sample pairs;

calculating the similarity of the first coding feature and the second coding feature in the positive sample opposite example to obtain a first similarity;

calculating the similarity of the first coding feature and the second coding feature in the negative sample pair examples to obtain a second similarity;

and calculating the contrast coding loss rate of the first speech unit according to the first similarity and the plurality of second similarities.

6. The method of claim 5, wherein said calculating a contrast coding loss rate for the first speech unit based on the first similarity and a plurality of second similarities comprises:

calculating a contrast coding loss rate for the first speech unit by:

wherein L is_NTo compare the coding loss rate, the first speech sequence X ═ X₁，x₂，......x_N}; t is the sequence number of the designated phonetic unit, t + k is the sequence number of the phonetic unit following the designated phonetic unit,

representing a primary coding network for a first speech unit x_t+kObtaining a first coding characteristic h after coding_t+kAnd by specifying the content characteristics c of the phonetic unit_tObtaining a first phonetic unit x_t+kOf the second coding feature of (1), W_kIs a linear matrix;

x_jfor the first speech sequence X except the first speech unit X_t+kA second speech unit other than the first speech unit,

representing a primary coding network for a second speech unit x_jAfter coding, the first coding characteristic and the content characteristic c of the specified voice unit are obtained_tObtaining a first phonetic unit x_t+kThe similarity of the second coding features of (1).

7. The method of claim 4, wherein said adjusting the network parameters of the encoded network according to the loss rate comprises:

and adjusting the network parameters of the primary coding network and the secondary coding network and the preset linear matrix according to the loss rate.

8. The method according to any one of claims 1-3, wherein the coding network comprises a primary coding network, and after training the coding network, inputting the second speech sequence into the coding network to train the coding network and the initialized decoding network, the trained coding network and decoding network serving as a speech recognition model comprises:

adopting a primary coding network of the coding network and an initialized decoding network to construct a voice recognition model;

and inputting the second voice sequence into the voice recognition model so as to train the primary coding network and the decoding network to obtain a trained voice recognition model.

9. The method of claim 8, wherein constructing a speech recognition model using a primary encoding network of the encoding networks and an initialized decoding network comprises:

and connecting the output layer of the primary coding network with the input layer of the initialized decoding network to obtain a voice recognition model.

10. The method of claim 9, wherein inputting the second speech sequence into the speech recognition model to train the primary coding network and the decoding network to obtain a trained speech recognition model comprises:

inputting the second speech sequence into the primary coding network to output predicted text at the decoding network;

calculating a loss rate by adopting the predicted text and the labeled text of the second voice sequence;

judging whether the loss rate is smaller than a preset threshold value or not;

if so, stopping training the primary coding network and the decoding network;

if not, adjusting the network parameters of the primary coding network and the decoding network according to the loss rate, and returning to the step of inputting the second voice sequence into the primary coding network so as to output the predicted text in the decoding network.

11. A speech recognition method, comprising:

acquiring voice data to be recognized;

wherein the speech recognition model is trained by the speech recognition model training method of any one of claims 1-10.

12. A speech recognition model training apparatus, comprising:

13. A speech recognition apparatus, comprising:

14. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the speech recognition model training method of any one of claims 1-10, and/or the speech recognition method of claim 11.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a speech recognition model training method as claimed in any one of claims 1 to 10 and/or a speech recognition method as claimed in claim 11.