CN112086087A - Speech recognition model training method, speech recognition method and device - Google Patents

Speech recognition model training method, speech recognition method and device Download PDF

Info

Publication number
CN112086087A
CN112086087A CN202010961964.8A CN202010961964A CN112086087A CN 112086087 A CN112086087 A CN 112086087A CN 202010961964 A CN202010961964 A CN 202010961964A CN 112086087 A CN112086087 A CN 112086087A
Authority
CN
China
Prior art keywords
coding
network
voice
unit
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010961964.8A
Other languages
Chinese (zh)
Other versions
CN112086087B (en
Inventor
唐浩雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Baiguoyuan Information Technology Co Ltd
Original Assignee
Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Baiguoyuan Information Technology Co Ltd filed Critical Guangzhou Baiguoyuan Information Technology Co Ltd
Priority to CN202010961964.8A priority Critical patent/CN112086087B/en
Publication of CN112086087A publication Critical patent/CN112086087A/en
Application granted granted Critical
Publication of CN112086087B publication Critical patent/CN112086087B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention discloses a speech recognition model training method, a speech recognition method and a speech recognition device, which comprise the following steps: acquiring a first voice sequence of an unlabeled text and a second voice sequence of an annotated text; inputting the first voice sequence into a coding network to obtain a first coding characteristic of a voice unit in the first voice sequence and a content characteristic of a specified voice unit; predicting a second encoding characteristic of the speech unit subsequent to the specified speech unit based on the content characteristic; calculating a contrast coding loss according to the first coding feature and the second coding feature of the voice unit after the designated voice unit so as to train the coding network; after the coding network is trained, the second speech sequence is input into the coding network to train the coding network and the decoding network. The coding network is trained by calculating the contrast coding loss through the first coding feature and the second coding feature, and the coding network can be trained by using label-free data, so that the number of labeled data is reduced, and the cost for acquiring the data is reduced.

Description

Speech recognition model training method, speech recognition method and device
Technical Field
Embodiments of the present invention relate to the field of speech recognition technologies, and in particular, to a speech recognition model training method, a speech recognition model training apparatus, a speech recognition apparatus, an electronic device, and a storage medium.
Background
In a live broadcast platform, the content of a large number of live broadcasts in a live broadcast room often needs to be supervised, the supervised objects comprise images and voice, and the voice in the live broadcasts mainly comes from the voice formed by the speaking of the live broadcasts. For the supervision of speech content, speech is usually recognized as text, and then the text is screened.
In the prior art, a speech is input into a trained speech recognition model to obtain a corresponding text, where the speech recognition model includes a coding network and a decoding network, the coding network codes an input speech to obtain speech features, and the decoding network decodes the coded speech features to obtain a text. When a speech recognition model is trained, an encoding network and a decoding network need to be trained, and a loss function needs to be calculated when both the encoding network and the decoding network are trained, specifically, a label is obtained by labeling speech data, the decoding network and the encoding network are trained by using the speech data with the label, a loss rate needs to be calculated by the label of the training data in the process of training the encoding network, and a loss rate needs to be calculated by the label of the training data in the process of training the decoding network and the encoding network together, that is, the whole training process needs to rely on a large amount of labeled speech data, so that a large amount of unlabeled speech data cannot be utilized, and the cost for acquiring the training data is increased.
Disclosure of Invention
The embodiment of the invention provides a speech recognition model training method, a speech recognition device, electronic equipment and a storage medium, and aims to solve the problem that the training data cost is high due to the fact that a large amount of unmarked data cannot be used because the existing training speech recognition model training process depends on marked data in the whole process.
In a first aspect, an embodiment of the present invention provides a method for training a speech recognition model, including:
acquiring a training data set, wherein the training data set comprises a first voice sequence without a labeled text and a second voice sequence with a labeled text;
inputting the first voice sequence into an initialized coding network to obtain a first coding characteristic of a voice unit in the first voice sequence and a content characteristic of a specified voice unit;
predicting a second encoding characteristic of a speech unit subsequent to the specified speech unit based on the content characteristic;
calculating a contrast coding loss according to the first coding feature and the second coding feature of the voice unit after the specified voice unit so as to train the coding network;
after the coding network is trained, inputting the second voice sequence into the coding network to train the coding network and the initialized decoding network, wherein the trained coding network and decoding network are used as voice recognition models.
In a second aspect, an embodiment of the present invention provides a speech recognition method, including:
acquiring voice data to be recognized;
inputting the voice data to be recognized into a pre-trained voice recognition model to obtain a recognition text;
the speech recognition model is trained by the speech recognition model training method according to any embodiment of the invention.
In a third aspect, an embodiment of the present invention provides a speech recognition model training apparatus, including:
the training data set acquisition module is used for acquiring a training data set, wherein the training data set comprises a first voice sequence without a labeled text and a second voice sequence with a labeled text;
the coding network coding module is used for inputting the first voice sequence into the initialized coding network to obtain a first coding characteristic of a voice unit in the first voice sequence and a content characteristic of a specified voice unit;
the coding feature prediction module is used for predicting a second coding feature of the voice unit after the specified voice unit according to the content feature;
the coding network training module is used for calculating contrast coding loss according to the first coding feature and the second coding feature of the voice unit after the specified voice unit so as to train the coding network;
and the coding network and decoding network training module is used for inputting the second voice sequence into the coding network after the coding network is trained so as to train the coding network and the initialized decoding network, and the trained coding network and decoding network are used as voice recognition models.
In a fourth aspect, an embodiment of the present invention provides a speech recognition apparatus, including:
the voice data to be recognized acquisition module is used for acquiring voice data to be recognized;
the voice recognition module is used for inputting the voice data to be recognized into a pre-trained voice recognition model to obtain a recognition text;
the speech recognition model is trained by the speech recognition model training method according to any embodiment of the invention.
In a fifth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the speech recognition model training method, and/or the speech recognition method, of any of the embodiments of the present invention.
In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a speech recognition model training method and/or a speech recognition method according to any embodiment of the present invention.
The speech recognition model comprises a coding network and a decoding network, when the coding network is trained, a first speech sequence without a labeled text is input into the initialized coding network to obtain a first coding feature of a speech unit in the first speech sequence and a content feature of a specified speech unit, after a second coding feature of the speech unit after the specified speech unit is predicted through the content feature, a contrast coding loss is calculated according to the first coding feature and the second coding feature of the speech unit after the specified speech unit to train the coding network, and finally the coding network and the decoding network are trained through a second speech sequence with a labeled text to obtain a final speech recognition model. The second coding characteristic is predicted through the content characteristic when the coding network is trained, and the network parameter of the coding network is adjusted by calculating the contrast coding loss through the first coding characteristic and the second coding characteristic, so that the first voice sequence does not need to be labeled with texts, a large amount of voice data without labeled texts can be used as the first voice sequence to train the coding network, the quantity of training data with text labels required when a voice recognition model is trained is reduced, and the cost of the training data is reduced.
Drawings
FIG. 1 is a schematic diagram of a speech recognition model in the prior art;
FIG. 2 is a flowchart illustrating steps of a method for training a speech recognition model according to an embodiment of the present invention;
FIG. 3A is a flowchart illustrating steps of a method for training a speech recognition model according to a second embodiment of the present invention;
FIG. 3B is a schematic diagram of an encoding network of an embodiment of the present invention;
FIG. 4 is a flowchart illustrating steps of a speech recognition method according to a third embodiment of the present invention;
fig. 5 is a block diagram of a speech recognition model training apparatus according to a fourth embodiment of the present invention;
fig. 6 is a block diagram of a speech recognition apparatus according to a fifth embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures. The embodiments and features of the embodiments in the present application may be combined with each other without conflict.
FIG. 1 is a diagram of a speech recognition model in the prior art.
As shown in FIG. 1, the speech recognition model is an end-to-end neural network, and generally has a network structure as shown in FIG. 1, which includes an encoding network (Encoder), a aligned network (CTC), and an encoding-decoder (ATTENTION-DECODER), in FIG. 1, OnFor the input speech signal, the blocks in Encoder are neural networks of Encoder, speech signal OnOutputting implicit characteristic h after passing through neural network of EncodernImplicit feature hnComputing CTC loss function and identified word y as input to CTCnThe character y to be recognized is required for calculating the CTC loss functionnAnd a speech signal OnThe marked text is compared and loss is calculated, and the hidden characteristic hnAlso as input to the attribute-decoder to calculate the text y by decodernAlso by means of speech signals OnAnd comparing the marked texts to calculate the loss ATT loss function. Therefore, when the speech recognition model is trained in the prior art, the loss is calculated by using the labeled text of the speech signal in training both the coding network and the decoding network, a large amount of speech data with labeled text is undoubtedly needed, the cost of training data is increased, and in order to solve the problem, the first embodiment and the second embodiment of the invention provide the speech recognition model training method.
Example one
Fig. 2 is a flowchart of steps of a method for training a speech recognition model according to an embodiment of the present invention, where the method is applicable to training a speech recognition model, and the method can be executed by a speech recognition model training apparatus according to an embodiment of the present invention, and the speech recognition model training apparatus can be implemented by hardware or software and integrated in an electronic device according to an embodiment of the present invention, and specifically, as shown in fig. 2, the method for training a speech recognition model according to an embodiment of the present invention can include the following steps:
s201, a training data set is obtained, wherein the training data set comprises a first voice sequence without a labeled text and a second voice sequence with a labeled text.
Specifically, the speech sequence may be an ordered sequence of speech units, and the speech units are connected in time sequence to obtain the speech sequence. In practical application, any voice data can be acquired, and the voice data is divided into a plurality of voice units according to the preset duration, so that a voice sequence can be obtained. For each voice sequence, if the corresponding text is marked, the second voice sequence with the marked text is obtained, and the text which is not marked is the first voice sequence without the marked text, wherein the text is the text capable of expressing the semantics of the voice sequence.
S202, inputting the first voice sequence into the initialized coding network to obtain the first coding feature of the voice unit in the first voice sequence and the content feature of the appointed voice unit.
In the embodiment of the invention, the voice recognition model comprises a coding network and a decoding network, wherein the coding network codes an input voice sequence, extracts the coding characteristics of the voice sequence, and the decoding network decodes the coding characteristics to obtain a corresponding text.
In particular, in the embodiment of the present invention, the coding network includes a primary coding network and a secondary coding network, wherein, the first voice sequence is input into the primary coding network to obtain the first coding characteristics of each voice unit in the first voice sequence, for each voice unit, the first coding feature of the voice unit and the state quantity of the previous voice unit of the voice unit can be input into a secondary coding network to obtain the content feature corresponding to the voice unit, the specified speech unit may refer to a speech unit other than the first and last speech units in the first speech sequence, the state quantity may be a state after a certain speech unit is encoded by the secondary coding network, and the content feature of the specified speech unit may be a result of encoding, by the secondary coding network, the first encoding feature of the specified speech unit and the first encoding features of a plurality of speech units before the specified speech unit.
And S203, predicting the second coding characteristics of the voice unit behind the appointed voice unit according to the content characteristics.
In the embodiment of the invention, the primary coding network codes the first voice sequence according to the preset step length, a plurality of voice units can be coded when the first voice sequence is coded according to the preset step length, after the first speech sequence is encoded by a step size, the last speech unit in the step size encoding can be determined as the designated speech unit, and the second encoding characteristic of the speech unit included in the next step size can be further predicted according to the content characteristic of the designated speech unit, in one example, a linear matrix may be set, by which the content characteristics of a given speech unit are multiplied, resulting in a second coding characteristic for the speech unit contained in the next step, that is, the second coding characteristic of the speech unit included in the next step is the second coding characteristic of the speech unit after the specified speech unit.
S204, calculating contrast coding loss according to the first coding feature and the second coding feature of the voice unit after the specified voice unit so as to train the coding network.
Specifically, for each first speech unit after the specified speech unit, the first coding feature is obtained by coding the first speech unit through a primary coding network, and the second coding feature of the first speech unit can be predicted by the content feature of the specified speech unit. A positive sample team example can be constructed by adopting the first coding features and the second coding features of the first voice unit, a plurality of negative sample team examples are constructed by adopting the first coding features of a plurality of second voice units and the second coding features of the first voice unit in sequence, then the similarity of the two coding features in the positive sample team example and the negative sample team example is calculated, the contrast coding loss rate of the first voice unit is calculated through the similarity, finally, the average value of the contrast coding loss rates of a plurality of first voice units behind the appointed voice unit is calculated to obtain the loss rate of the training in the current round, and whether the loss rate is smaller than a preset threshold value is judged; if so, stopping training the coding network; if not, adjusting the network parameters of the coding network according to the loss rate, returning to the step of inputting the first voice sequence into the primary coding network of the coding network to obtain the first coding feature of each voice unit in the first voice sequence, and obtaining the trained coding network when the loss rate is smaller than the preset threshold value.
S205, after the coding network is trained, inputting the second voice sequence into the coding network to train the coding network and the initialized decoding network, wherein the trained coding network and decoding network are used as voice recognition models.
After the coding network is trained, the coding network and the decoding network are connected to form a speech recognition model, in one example, the coding network comprises a primary coding network and a secondary coding network, an output layer of the primary coding network and an input layer of the decoding network can be connected to form the speech recognition model, when the speech recognition model is trained, a second speech sequence is input to the input layer of the primary coding network to obtain a predicted text after the decoding network decodes, a loss rate is calculated through the predicted text and a labeled text of the second speech sequence, and network parameters of the primary coding network and the decoding network are adjusted according to the loss rate until the loss rate is smaller than a preset value.
The speech recognition model comprises a coding network and a decoding network, when the coding network is trained, a first speech sequence without a labeled text is input into the initialized coding network to obtain a first coding feature of a speech unit in the first speech sequence and a content feature of a specified speech unit, after a second coding feature of the speech unit after the specified speech unit is predicted through the content feature, a contrast coding loss is calculated according to the first coding feature and the second coding feature of the speech unit after the specified speech unit to train the coding network, and finally the coding network and the decoding network are trained through a second speech sequence with a labeled text to obtain a final speech recognition model. The second coding characteristic is predicted through the content characteristic when the coding network is trained, and the coding network is trained through calculating the contrast coding loss through the first coding characteristic and the second coding characteristic, so that the first voice sequence does not need to be labeled with texts, a large amount of voice data without labeled texts can be used as the first voice sequence, the quantity of training data with text labels required when a voice recognition model is trained is reduced, and the cost of the training data is reduced.
Example two
Fig. 3A is a flowchart of steps of a speech recognition model training method according to a second embodiment of the present invention, which is optimized based on the first embodiment of the present invention, and specifically, as shown in fig. 3A, the speech recognition model training method according to the second embodiment of the present invention may include the following steps:
s301, a training data set is obtained, wherein the training data set comprises a first voice sequence without a labeled text and a second voice sequence with a labeled text.
Specifically, the first speech sequence is not labeled with a text expressing the semantics of the first speech sequence, the second speech sequence is labeled with a text expressing the semantics of the first speech sequence, and the number of the first speech sequence and the second speech sequence may be multiple in the training data set.
S302, inputting the first voice sequence into a primary coding network of the initialized coding network to obtain a first coding feature of each voice unit in the first voice sequence.
In the embodiment of the invention, the coding network comprises a primary coding network and a secondary coding network, wherein the primary coding network codes the voice units in the first voice sequence to obtain the first coding characteristics, and the secondary coding network codes the voice units by taking the first coding characteristics and the state quantity as input to obtain the content characteristics.
Fig. 3B is a schematic diagram of a coding network, and in fig. 3B, a first speech sequence X ═ X1,x2,......xN}fencFor a primary coding network, farFor secondary coding of networks,htFor the first coding feature, StIs a state quantity, ctIs a content feature.
When the first speech sequence X is ═ { X ═ X1,x2,......xNInput primary coding network fencThen once encode the network fencCoding the voice units in the first voice sequence X by a preset step length to obtain a first coding characteristic of each voice unit, namely:
ht=fenc(xt)
in one example, the network f is once encodedencThe first coding network may be a vggnet (visual Geometry Group network), and of course, the first coding network may also be other neural networks, which is not limited in this embodiment of the present invention.
S303, aiming at each voice unit, inputting the first coding feature of the voice unit and the state quantity of the previous voice unit of the voice unit into a secondary coding network of the coding network to obtain the content feature of the specified voice unit.
Specifically, as shown in FIG. 3B, x is applied to each phonetic unittPhonetic unit xtFirst coding feature h oftAnd the state quantity St-1Input secondary coding network farThen, a phonetic unit x is obtainedtCorresponding content characteristics ctNamely:
ct=far(ht,st-1)
as shown in FIG. 3B, each phonetic unit xtEach corresponding to a content feature ctThe specified speech unit may be the last speech unit in the speech units included in the current step length when the speech units in the first speech sequence X are encoded with the preset step length, as shown in fig. 3B, where the step length is 4, and the primary coding network f is a primary coding network fencCurrent speech unit xt-3、xt-2、xt-1、xtCoding is carried out, and then the voice unit is designated as a voice unit xtOf course, the specified speech unit may be any one of the speech units in the first speech sequence.
At one positionIn this example, the secondary coding network farMay be an RNN (Recurrent Neural Network), such as an LSTM (Long Short-Term Neural Network), and may of course be other Neural networks, which is not limited in this embodiment of the present invention.
S304, multiplying the content characteristic and a preset linear matrix to obtain a second coding characteristic of the voice unit behind the appointed voice unit.
Specifically, the linear matrix W is initializedkThe linear matrix WkFor the network parameters to be adjusted in the training process, the second coding characteristics of the phonetic unit after the phonetic unit are designated as WkctDesignating a phonetic unit as xtDesignating a phonetic unit as xtThe subsequent phonetic unit is xt+kThe phonetic unit is xt+kIs characterized by a second coding
Figure BDA0002680855450000111
S305, calculating the contrast coding loss rate of each first voice unit after the specified voice unit by using the first coding characteristics and the second coding characteristics of the first voice unit and the first coding characteristics of a plurality of second voice units except the first voice unit.
In an optional embodiment of the present invention, for each first speech unit after the specified speech unit, the first coding feature and the second coding feature of the first speech unit are used to form a positive sample pair example, the second coding feature of the first speech unit and the first coding features of a plurality of second speech units other than the first speech unit are used to form a plurality of negative sample pair examples, the similarity between the first coding feature and the second coding feature in the positive sample pair example is calculated to obtain a first similarity, the similarity between the first coding feature and the second coding feature in the plurality of negative sample pair examples is calculated to obtain a second similarity, and the contrast coding loss rate of the first speech unit is calculated according to the first similarity and the plurality of second similarities.
Specifically, for a given phonetic unit isxtEach first phonetic unit thereafter is xt+kThe first phonetic unit is xt+kOnce encoded network fencThe first coded characteristic after coding is ht+kBy means of content features ctPredicting the first phonetic unit as xt+kIs characterized by a second coding
Figure BDA0002680855450000112
Then x for each first phonetic unitt+kThe positive sample is
Figure BDA0002680855450000113
Negative examples are
Figure BDA0002680855450000114
Wherein h istIs the first phonetic unit xt+kA plurality of second speech units x other than the first speech unit xtThe first speech unit x can then be calculated by the following formulat+kContrast coding loss ratio of (1):
Figure BDA0002680855450000121
wherein,
Figure BDA0002680855450000122
to compare the coding loss rate, the first speech sequence X ═ X1,x2,......xNT is the sequence number of the appointed voice unit, and t + k is the sequence number of the first voice unit after the appointed voice unit.
In the above-mentioned formula (1),
Figure BDA0002680855450000123
representing a primary coded network fencFor the first speech unit xt+kObtaining a first coding characteristic h after codingt+kAnd by specifying phonetic unit xtContent feature c oftPredicting a first speech unit xt+kSecond coding feature of
Figure BDA0002680855450000124
Similarity of (2), WkIs a linear matrix;
Figure BDA0002680855450000125
representing a primary coded network fencCoding the second voice unit xj to obtain a first coding characteristic hjAnd by specifying phonetic unit xtContent feature c oftPredicting a first speech unit xt+kSecond coding feature of
Figure BDA0002680855450000126
Similarity of (2), xjFor the first speech sequence X except the first speech unit Xt+kAnd a second speech unit.
S306, calculating the average value of the contrast coding loss rates of the first voice units to obtain the loss rate.
In practical application, after the speech unit is specified, a plurality of first speech units may be included, and after the contrast coding loss rate of the first speech unit is calculated, the average value of the plurality of contrast coding loss rates may be calculated as the loss rate of the current training iteration of the coding network.
In order to make the process of calculating the loss rate according to the embodiment of the present invention more clearly understood by those skilled in the art, the following example is illustrated in conjunction with fig. 3B as follows:
as shown in fig. 3B, the primary coding network fencIs 4, i.e. 4 speech units can be encoded at a time, the current step size being for speech unit xt-3、xt-2、xt-1、xtEncoding is carried out to determine the designated phonetic unit as xtPhonetic unit xt-3、xt-2、xt-1、xtThe corresponding first coding characteristics are ht-3、ht-2、ht-1、htPhonetic unit xt-3、xt-2、xt-1、xtCorresponding content characteristic is ct-3、ct-2、ct-1、ctSpecifying a voice noteElement xtThe first speech unit thereafter being xt+1、xt+2、xt+3The first phonetic unit is xt+1、xt+2、xt+3Respectively is ht+1、ht+2、ht+3By means of content features ctSeparately predicting a first speech unit as xt+1、xt+2、xt+3Is characterized by a second coding
Figure BDA0002680855450000131
Then for each first phonetic unit (x)t+1、xt+2、xt+3) Positive and negative examples are as follows:
for the first phonetic unit xt+1
The positive sample is as follows
Figure BDA0002680855450000132
Negative examples are
Figure BDA0002680855450000133
Etc.;
for the first phonetic unit xt+2
The positive sample is as follows
Figure BDA0002680855450000134
Negative examples are
Figure BDA0002680855450000135
Etc.;
for the first phonetic unit xt+3
The positive sample is as follows
Figure BDA0002680855450000136
Negative examples are
Figure BDA0002680855450000137
Etc.;
for the first phonetic unit xt+1The loss rate was calculated as follows:
calculate positive sample alignment
Figure BDA0002680855450000138
Obtaining a first similarity by the similarity of the two coding features, and respectively calculating negative sample examples
Figure BDA0002680855450000139
The similarity of the two coding features in the speech unit x is obtained by obtaining a plurality of second similarities, calculating the sum of the second similarities to obtain a second similarity sum, using the first similarity as the numerator of the formula (1), and using the second similarity sum as the denominator of the formula (1) to obtain the first speech unit xt+1Comparing the coding loss rates, and then for the first speech unit xt+1、xt+2、xt+3The loss rate of the training of the current round can be obtained by averaging the loss rates of the comparison codes.
As can be seen from the formula (1), the closer the two coding features are in the positive example sample f and the negative example sample fk(xt+k,ct) The larger the value of (c), the negative examplek(xj,ct) The smaller the value of (A), the lower the contrast coding loss rate
Figure BDA0002680855450000141
The smaller the coding network is, the purpose of training the coding network is to optimize network parameters of the primary coding network and the secondary coding network and a linear matrix for prediction, so that the first coding characteristic and the second coding characteristic of each speech unit are close to each other, and finally, the contrast coding loss rate is minimum.
And S307, judging whether the loss rate is smaller than a preset threshold value.
Specifically, after each round of training is completed, whether the precision of the coding network is sufficient is determined through the loss rate, if the loss rate is judged to be smaller than the preset threshold, if so, the precision of the coding network is sufficient, S308 may be executed, and the training of the coding network is stopped, otherwise, the coding network needs to be trained continuously, and S309 is executed.
And S308, stopping training the coding network.
That is, when the loss rate is smaller than the preset threshold, the network parameters of the primary coding network in the coding network are saved, and then S310-S311 are executed.
S309, adjusting the network parameters of the coding network according to the loss rate.
Specifically, when the loss rate is greater than the preset threshold, calculating a gradient according to the loss rate, performing gradient descent on the network parameters of the primary coding network and the secondary coding network and the preset linear matrix, and then returning to S302 to retrain the coding network until the loss rate is less than the preset threshold.
According to the embodiment of the invention, the coding network is pre-trained, so that the coding network can learn the time sequence information, the loss rate is calculated without labels, the training data is not labeled, a large amount of label-free data can be used for training the coding network, the data volume of labeled data in a training speech recognition model is reduced, and the cost of the training data is reduced.
S310, adopting a primary coding network of the coding network and an initialized decoding network to construct a voice recognition model.
After the coding network is trained, a primary coding network in the coding network and an initialized decoding network can be connected to form a speech recognition model, specifically, an output layer of the primary coding network and an input layer of the initialized decoding network can be connected to obtain the speech recognition model, so that the first coding feature output by the primary coding network is input into the decoding network to be decoded to obtain the predicted text.
S311, inputting the second voice sequence into the voice recognition model, so as to train the primary coding network and the decoding network to obtain a trained voice recognition model.
After a primary coding network and a decoding network are connected to form a voice recognition model, the voice recognition model is subjected to global training, specifically, a second voice sequence can be input into the primary coding network to output a predicted text in the decoding network, the loss rate is calculated by adopting the predicted text and a labeled text of the second voice sequence, and whether the loss rate is smaller than a preset threshold value or not is judged; if so, stopping training the primary coding network and the decoding network; if not, adjusting network parameters of the primary coding network and the decoding network according to the loss rate, returning to the step of inputting the second voice sequence into the primary coding network to output the predicted text in the decoding network, and obtaining the finally trained voice recognition model which is a voice recognition model formed by the trained primary coding network and the trained decoding network after the loss rate is smaller than a preset threshold value.
The embodiment of the invention inputs a first voice sequence without a label text into a primary coding network of an initialized coding network to obtain a first coding characteristic of each voice unit in the first voice sequence, inputs the first coding characteristic of the voice unit and a state quantity of a previous voice unit of the voice unit into a secondary coding network of the coding network aiming at each voice unit to obtain a content characteristic of a specified voice unit, multiplies the content characteristic and a preset linear matrix to obtain a second coding characteristic of the voice unit after the specified voice unit, calculates a contrast coding loss rate of the first voice unit by using the first coding characteristic and the second coding characteristic of the first voice unit and the first coding characteristics of a plurality of second voice units except the first voice unit aiming at each first voice unit after the specified voice unit, and calculates the contrast coding loss rate to obtain the loss rate, and adjusting network parameters of the coding network according to the loss rate, after the primary coding network is trained, adopting the primary coding network and the decoding network to form a voice recognition model, and adopting a second voice sequence with a labeled text to train the voice recognition model. Because the second coding characteristic is predicted through the content characteristic when the coding network is trained, the coding network is trained through calculating the contrast coding loss through the first coding characteristic and the second coding characteristic, the first voice sequence does not need to label texts, a large amount of voice data without label texts can be used as the first voice sequence to train the coding network, the quantity of training data with text labels required when a voice recognition model is trained is reduced, and the cost of the training data is reduced.
EXAMPLE III
Fig. 4 is a flowchart of steps of a speech recognition method according to a third embodiment of the present invention, where the speech recognition method according to the third embodiment of the present invention is applicable to a case where speech is recognized as a text, and the speech recognition method according to the third embodiment of the present invention may be executed by a speech recognition apparatus according to an embodiment of the present invention, where the speech recognition apparatus may be implemented by hardware or software and integrated in an electronic device according to an embodiment of the present invention, and specifically, as shown in fig. 4, the speech recognition method according to the third embodiment of the present invention may include the following steps:
s401, voice data to be recognized are obtained.
In the embodiment of the present invention, the voice data to be recognized may be voice data that needs to be recognized as a text, the voice data to be recognized may be voice data on a short video or a live broadcast platform, or may also be voice data in a movie or a television show, and in addition, the language of the voice data to be recognized may be chinese, english or other languages, or even a local dialect.
S402, inputting the voice data to be recognized into a pre-trained voice recognition model to obtain a recognition text.
The embodiment of the present invention may train the speech recognition model by using the speech recognition model training method provided in the first embodiment or the second embodiment, where the speech recognition model may obtain the recognition text corresponding to the speech data after inputting the speech data, and the speech recognition model training method may refer to the first embodiment and the second embodiment, and will not be described in detail herein.
After the voice data to be recognized is obtained, preprocessing such as denoising and enhancing can be performed on the voice data to be recognized, the preprocessed voice data are divided into a plurality of voice segments to obtain a voice sequence, the voice sequence is input into a trained voice recognition model, the voice sequence is encoded through an encoding network in the voice recognition model to obtain encoding characteristics, then the encoding characteristics are decoded through a decoding network to obtain a recognition text, the recognition text can be screened to determine whether violation content exists, and therefore voice supervision is achieved.
In the embodiment of the invention, when a speech recognition model required by speech recognition is trained, a first speech sequence without a labeled text is input into an initialized coding network to obtain a first coding feature of a speech unit in the first speech sequence and a content feature of a specified speech unit, after a second coding feature of the speech unit after the specified speech unit is predicted through the content feature, a contrast coding loss is calculated according to the first coding feature and the second coding feature of the speech unit after the specified speech unit to train the coding network, and finally the coding network and a decoding network are trained through a second speech sequence with a labeled text to obtain a final speech recognition model. Because the second coding characteristic is predicted through the content characteristic when the coding network is trained, the coding network is trained through calculating the contrast coding loss through the first coding characteristic and the second coding characteristic, the first voice sequence does not need to label texts, a large amount of voice data without label texts can be used as the first voice sequence to train the coding network, the quantity of training data with text labels required when a voice recognition model is trained is reduced, and the cost of the training data is reduced.
Example four
Fig. 5 is a block diagram of a structure of a speech recognition model training apparatus according to a fourth embodiment of the present invention, and as shown in fig. 5, the speech recognition model training apparatus according to the fourth embodiment of the present invention may specifically include the following modules:
a training data set obtaining module 501, configured to obtain a training data set, where the training data set includes a first speech sequence without a labeled text and a second speech sequence with a labeled text;
a coding network coding module 502, configured to input the first voice sequence into an initialized coding network, so as to obtain a first coding feature of a voice unit in the first voice sequence and a content feature of an assigned voice unit;
an encoding feature prediction module 503, configured to predict, according to the content feature, a second encoding feature of the speech unit after the specified speech unit;
a coding network training module 504, configured to calculate a contrast coding loss according to the first coding feature and the second coding feature of the speech unit after the specified speech unit, so as to train the coding network;
and an encoding network and decoding network training module 505, configured to, after the encoding network is trained, input the second speech sequence into the encoding network to train the encoding network and the initialized decoding network, where the trained encoding network and decoding network serve as speech recognition models.
The speech recognition model training device provided by the embodiment of the invention can execute the speech recognition model training method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
EXAMPLE five
Fig. 6 is a block diagram of a speech recognition apparatus according to a fifth embodiment of the present invention, and as shown in fig. 6, the speech recognition apparatus according to the fifth embodiment of the present invention may specifically include the following modules:
a to-be-recognized voice data acquisition module 601, configured to acquire to-be-recognized voice data;
a speech recognition module 602, configured to input the speech data to be recognized into a pre-trained speech recognition model to obtain a recognition text;
the speech recognition model is trained by the speech recognition model training method according to any embodiment of the invention.
The voice recognition device provided by the embodiment of the invention can execute the voice recognition method provided by the third embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
EXAMPLE six
Referring to fig. 7, a schematic structural diagram of an electronic device in one example of the invention is shown. As shown in fig. 7, the electronic device may specifically include: a processor 701, a storage device 702, a display screen 703 with touch functionality, an input device 704, an output device 705, and a communication device 706. The number of the processors 701 in the electronic device may be one or more, and one processor 701 is taken as an example in fig. 7. The processor 701, the storage device 702, the display 703, the input device 704, the output device 705, and the communication device 706 of the electronic apparatus may be connected by a bus or other means, and fig. 7 illustrates an example of connection by a bus. The electronic device is used for executing the speech recognition model training method provided by any embodiment of the invention and/or the speech recognition method.
Embodiments of the present invention also provide a computer-readable storage medium, where instructions, when executed by a processor of an electronic device, enable the device to perform a speech recognition model training method and/or a speech recognition method as described in the above method embodiments.
It should be noted that, as for the embodiments of the apparatus, the electronic device, and the storage medium, since they are basically similar to the embodiments of the method, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the embodiments of the method.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious modifications, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (15)

1. A method for training a speech recognition model, comprising:
acquiring a training data set, wherein the training data set comprises a first voice sequence without a labeled text and a second voice sequence with a labeled text;
inputting the first voice sequence into an initialized coding network to obtain a first coding characteristic of a voice unit in the first voice sequence and a content characteristic of a specified voice unit;
predicting a second encoding characteristic of a speech unit subsequent to the specified speech unit based on the content characteristic;
calculating a contrast coding loss according to the first coding feature and the second coding feature of the voice unit after the specified voice unit so as to train the coding network;
after the coding network is trained, inputting the second voice sequence into the coding network to train the coding network and the initialized decoding network, wherein the trained coding network and decoding network are used as voice recognition models.
2. The method of claim 1, wherein the coding network comprises a primary coding network and a secondary coding network, and the inputting the first speech sequence into the initialized coding network to obtain the first coding characteristics of the speech units in the first speech sequence and the content characteristics of the specified speech units comprises:
inputting the first voice sequence into a primary coding network of the initialized coding network to obtain a first coding characteristic of each voice unit in the first voice sequence;
and for each voice unit, inputting the first coding characteristic of the voice unit and the state quantity of the previous voice unit of the voice unit into a secondary coding network of the coding network to obtain the content characteristic of the specified voice unit.
3. The method of claim 1, wherein predicting the second coding feature of the speech unit subsequent to the specified speech unit based on the content feature comprises:
and multiplying the content characteristics by a preset linear matrix to obtain a second coding characteristic of the voice unit after the appointed voice unit.
4. The method according to any of claims 1-3, wherein the coding network comprises a primary coding network and a secondary coding network, and wherein the calculating a contrast coding loss from the first coding features and the second coding features of the phonetic units after the specified phonetic unit to train the coding network comprises:
for each first voice unit after the specified voice unit, calculating a contrast coding loss rate of the first voice unit by using the first coding feature of the first voice unit, the second coding feature and first coding features of a plurality of second voice units except the first voice unit;
calculating the average value of the contrast coding loss rates of the first voice units to obtain a loss rate;
judging whether the loss rate is smaller than a preset threshold value or not;
if so, stopping training the coding network;
if not, adjusting the network parameters of the coding network according to the loss rate, and returning to the step of inputting the first voice sequence into the primary coding network of the coding network to obtain the first coding feature of each voice unit in the first voice sequence.
5. The method of claim 4, wherein the calculating, for each first speech unit subsequent to the specified speech unit, the contrast coding loss ratio for the first speech unit using the first coding feature of the first speech unit, the second coding feature, and the first coding features of a plurality of second speech units other than the first speech unit comprises:
for each first voice unit, adopting the first coding characteristics and the second coding characteristics of the first voice unit to form a positive sample opposite example;
adopting the second coding characteristics of the first voice unit and the first coding characteristics of a plurality of second voice units except the first voice unit to form a plurality of negative sample pairs;
calculating the similarity of the first coding feature and the second coding feature in the positive sample opposite example to obtain a first similarity;
calculating the similarity of the first coding feature and the second coding feature in the negative sample pair examples to obtain a second similarity;
and calculating the contrast coding loss rate of the first speech unit according to the first similarity and the plurality of second similarities.
6. The method of claim 5, wherein said calculating a contrast coding loss rate for the first speech unit based on the first similarity and a plurality of second similarities comprises:
calculating a contrast coding loss rate for the first speech unit by:
Figure FDA0002680855440000031
wherein L isNTo compare the coding loss rate, the first speech sequence X ═ X1,x2,......xN}; t is the sequence number of the designated phonetic unit, t + k is the sequence number of the phonetic unit following the designated phonetic unit,
Figure FDA0002680855440000032
representing a primary coding network for a first speech unit xt+kObtaining a first coding characteristic h after codingt+kAnd by specifying the content characteristics c of the phonetic unittObtaining a first phonetic unit xt+kOf the second coding feature of (1), WkIs a linear matrix;
xjfor the first speech sequence X except the first speech unit Xt+kA second speech unit other than the first speech unit,
Figure FDA0002680855440000033
representing a primary coding network for a second speech unit xjAfter coding, the first coding characteristic and the content characteristic c of the specified voice unit are obtainedtObtaining a first phonetic unit xt+kThe similarity of the second coding features of (1).
7. The method of claim 4, wherein said adjusting the network parameters of the encoded network according to the loss rate comprises:
and adjusting the network parameters of the primary coding network and the secondary coding network and the preset linear matrix according to the loss rate.
8. The method according to any one of claims 1-3, wherein the coding network comprises a primary coding network, and after training the coding network, inputting the second speech sequence into the coding network to train the coding network and the initialized decoding network, the trained coding network and decoding network serving as a speech recognition model comprises:
adopting a primary coding network of the coding network and an initialized decoding network to construct a voice recognition model;
and inputting the second voice sequence into the voice recognition model so as to train the primary coding network and the decoding network to obtain a trained voice recognition model.
9. The method of claim 8, wherein constructing a speech recognition model using a primary encoding network of the encoding networks and an initialized decoding network comprises:
and connecting the output layer of the primary coding network with the input layer of the initialized decoding network to obtain a voice recognition model.
10. The method of claim 9, wherein inputting the second speech sequence into the speech recognition model to train the primary coding network and the decoding network to obtain a trained speech recognition model comprises:
inputting the second speech sequence into the primary coding network to output predicted text at the decoding network;
calculating a loss rate by adopting the predicted text and the labeled text of the second voice sequence;
judging whether the loss rate is smaller than a preset threshold value or not;
if so, stopping training the primary coding network and the decoding network;
if not, adjusting the network parameters of the primary coding network and the decoding network according to the loss rate, and returning to the step of inputting the second voice sequence into the primary coding network so as to output the predicted text in the decoding network.
11. A speech recognition method, comprising:
acquiring voice data to be recognized;
inputting the voice data to be recognized into a pre-trained voice recognition model to obtain a recognition text;
wherein the speech recognition model is trained by the speech recognition model training method of any one of claims 1-10.
12. A speech recognition model training apparatus, comprising:
the training data set acquisition module is used for acquiring a training data set, wherein the training data set comprises a first voice sequence without a labeled text and a second voice sequence with a labeled text;
the coding network coding module is used for inputting the first voice sequence into the initialized coding network to obtain a first coding characteristic of a voice unit in the first voice sequence and a content characteristic of a specified voice unit;
the coding feature prediction module is used for predicting a second coding feature of the voice unit after the specified voice unit according to the content feature;
the coding network training module is used for calculating contrast coding loss according to the first coding feature and the second coding feature of the voice unit after the specified voice unit so as to train the coding network;
and the coding network and decoding network training module is used for inputting the second voice sequence into the coding network after the coding network is trained so as to train the coding network and the initialized decoding network, and the trained coding network and decoding network are used as voice recognition models.
13. A speech recognition apparatus, comprising:
the voice data to be recognized acquisition module is used for acquiring voice data to be recognized;
the voice recognition module is used for inputting the voice data to be recognized into a pre-trained voice recognition model to obtain a recognition text;
wherein the speech recognition model is trained by the speech recognition model training method of any one of claims 1-10.
14. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the speech recognition model training method of any one of claims 1-10, and/or the speech recognition method of claim 11.
15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a speech recognition model training method as claimed in any one of claims 1 to 10 and/or a speech recognition method as claimed in claim 11.
CN202010961964.8A 2020-09-14 2020-09-14 Speech recognition model training method, speech recognition method and device Active CN112086087B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010961964.8A CN112086087B (en) 2020-09-14 2020-09-14 Speech recognition model training method, speech recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010961964.8A CN112086087B (en) 2020-09-14 2020-09-14 Speech recognition model training method, speech recognition method and device

Publications (2)

Publication Number Publication Date
CN112086087A true CN112086087A (en) 2020-12-15
CN112086087B CN112086087B (en) 2024-03-12

Family

ID=73737784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010961964.8A Active CN112086087B (en) 2020-09-14 2020-09-14 Speech recognition model training method, speech recognition method and device

Country Status (1)

Country Link
CN (1) CN112086087B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113270090A (en) * 2021-05-19 2021-08-17 平安科技(深圳)有限公司 Combined model training method and device based on ASR model and TTS model
CN113539246A (en) * 2021-08-20 2021-10-22 北京房江湖科技有限公司 Speech recognition method and device
CN114171013A (en) * 2021-12-31 2022-03-11 西安讯飞超脑信息科技有限公司 Voice recognition method, device, equipment and storage medium
CN114783446A (en) * 2022-06-15 2022-07-22 北京信工博特智能科技有限公司 Voice recognition method and system based on contrast predictive coding

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180336884A1 (en) * 2017-05-19 2018-11-22 Baidu Usa Llc Cold fusing sequence-to-sequence models with language models
US20190304480A1 (en) * 2018-03-29 2019-10-03 Ford Global Technologies, Llc Neural Network Generative Modeling To Transform Speech Utterances And Augment Training Data
CN110570845A (en) * 2019-08-15 2019-12-13 武汉理工大学 Voice recognition method based on domain invariant features
CN110648658A (en) * 2019-09-06 2020-01-03 北京达佳互联信息技术有限公司 Method and device for generating voice recognition model and electronic equipment
CN111063336A (en) * 2019-12-30 2020-04-24 天津中科智能识别产业技术研究院有限公司 End-to-end voice recognition system based on deep learning
CN111128137A (en) * 2019-12-30 2020-05-08 广州市百果园信息技术有限公司 Acoustic model training method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180336884A1 (en) * 2017-05-19 2018-11-22 Baidu Usa Llc Cold fusing sequence-to-sequence models with language models
US20190304480A1 (en) * 2018-03-29 2019-10-03 Ford Global Technologies, Llc Neural Network Generative Modeling To Transform Speech Utterances And Augment Training Data
CN110570845A (en) * 2019-08-15 2019-12-13 武汉理工大学 Voice recognition method based on domain invariant features
CN110648658A (en) * 2019-09-06 2020-01-03 北京达佳互联信息技术有限公司 Method and device for generating voice recognition model and electronic equipment
CN111063336A (en) * 2019-12-30 2020-04-24 天津中科智能识别产业技术研究院有限公司 End-to-end voice recognition system based on deep learning
CN111128137A (en) * 2019-12-30 2020-05-08 广州市百果园信息技术有限公司 Acoustic model training method and device, computer equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113270090A (en) * 2021-05-19 2021-08-17 平安科技(深圳)有限公司 Combined model training method and device based on ASR model and TTS model
CN113539246A (en) * 2021-08-20 2021-10-22 北京房江湖科技有限公司 Speech recognition method and device
CN114171013A (en) * 2021-12-31 2022-03-11 西安讯飞超脑信息科技有限公司 Voice recognition method, device, equipment and storage medium
CN114783446A (en) * 2022-06-15 2022-07-22 北京信工博特智能科技有限公司 Voice recognition method and system based on contrast predictive coding
CN114783446B (en) * 2022-06-15 2022-09-06 北京信工博特智能科技有限公司 Voice recognition method and system based on contrast predictive coding

Also Published As

Publication number Publication date
CN112086087B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
CN110489555B (en) Language model pre-training method combined with similar word information
CN110941945B (en) Language model pre-training method and device
CN112086087B (en) Speech recognition model training method, speech recognition method and device
US20180357225A1 (en) Method for generating chatting data based on artificial intelligence, computer device and computer-readable storage medium
WO2021072875A1 (en) Intelligent dialogue generation method, device, computer apparatus and computer storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN114676234A (en) Model training method and related equipment
CN112017643B (en) Speech recognition model training method, speech recognition method and related device
CN111428470B (en) Text continuity judgment method, text continuity judgment model training method, electronic device and readable medium
CN112633007B (en) Semantic understanding model construction method and device and semantic understanding method and device
CN116050496A (en) Determination method and device, medium and equipment of picture description information generation model
CN116578688A (en) Text processing method, device, equipment and storage medium based on multiple rounds of questions and answers
CN117876940B (en) Video language task execution and model training method, device, equipment and medium thereof
CN110377778A (en) Figure sort method, device and electronic equipment based on title figure correlation
CN113761946B (en) Model training and data processing method and device, electronic equipment and storage medium
CN110188158A (en) Keyword and topic label generating method, device, medium and electronic equipment
CN112307179A (en) Text matching method, device, equipment and storage medium
CN116341651A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN115810068A (en) Image description generation method and device, storage medium and electronic equipment
CN111475635A (en) Semantic completion method and device and electronic equipment
CN109829040B (en) Intelligent conversation method and device
CN111738791A (en) Text processing method, device, equipment and storage medium
CN110717316B (en) Topic segmentation method and device for subtitle dialog flow
CN116306663B (en) Semantic role labeling method, device, equipment and medium
CN112183062A (en) Spoken language understanding method based on alternate decoding, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant