CN114783446A - Voice recognition method and system based on contrast predictive coding - Google Patents

Voice recognition method and system based on contrast predictive coding Download PDF

Info

Publication number
CN114783446A
CN114783446A CN202210670592.2A CN202210670592A CN114783446A CN 114783446 A CN114783446 A CN 114783446A CN 202210670592 A CN202210670592 A CN 202210670592A CN 114783446 A CN114783446 A CN 114783446A
Authority
CN
China
Prior art keywords
data
voice
segment
time sequence
sequence data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210670592.2A
Other languages
Chinese (zh)
Other versions
CN114783446B (en
Inventor
戴亦斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Technology Bote Intelligent Technology Co ltd
Original Assignee
Beijing Information Technology Bote Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Technology Bote Intelligent Technology Co ltd filed Critical Beijing Information Technology Bote Intelligent Technology Co ltd
Priority to CN202210670592.2A priority Critical patent/CN114783446B/en
Publication of CN114783446A publication Critical patent/CN114783446A/en
Application granted granted Critical
Publication of CN114783446B publication Critical patent/CN114783446B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a voice recognition method and a system based on contrast predictive coding, which belong to the technical field of voiceprint recognition and are characterized by comprising the following steps: s1, collecting A voice files of each voice category, and preprocessing each voice file to obtain PCM coded voice time sequence data; s2, constructing a pairing data set of the voice time sequence data; s3, constructing a pairing fragment data set; s4, constructing an artificial neural network; s5, training a voice recognition network consisting of the first converter, the second converter and a one-dimensional convolution neural network; and S6, performing voice recognition through the voice recognition network. The invention fully utilizes a large amount of insufficient voice data acquired by the background, regards the voice data as time sequence data, and directly realizes end-to-end conversion without extracting the voice time sequence data characteristics in the process.

Description

Voice recognition method and system based on contrast predictive coding
Technical Field
The invention belongs to the technical field of voiceprint recognition, and particularly relates to a voice recognition method and system based on comparative predictive coding.
Background
As is known, speech recognition often needs to collect a large amount of speech data, that is, under various background environments, the number of data pieces needs to be sufficient under various semantic (including various speech and dialect) conditions of speech to be recognized. If a voice uttered by using a particular dialect (or text semantics) in a particular background does not acquire enough data, when the voice recognition model is used under the condition, model failures such as reduction in detection accuracy, inability to recognize, and the like may occur. The traditional technology solves the problems by the following steps: most of the features need to be extracted by various feature extraction methods similar to MFCC feature extraction, and then the features are classified to obtain a classification result. In this case, it is important to determine whether the data in each category is sufficient and representative, and if the data is insufficient and atypical, partial features related to the specific category may be missing or distorted, which affects the final classification result.
Disclosure of Invention
The technical purpose is as follows: the invention provides a voice recognition method and a system based on comparison predictive coding; a large amount of insufficient voice data acquired by a background are fully utilized, the voice data are regarded as time sequence data, end-to-end conversion is directly carried out, intermediate voice spectrum features are not required to be extracted, a certain number of fragments with fixed time are extracted randomly by each voice, each fragment is divided into front data and rear data, coding prediction of the rear time sequence data is achieved through a first converter when the front data are input, time sequence prediction of the front data is achieved through a second converter when the rear data are input, after the predicted data are combined, end-to-end paired data comparison is directly carried out on the predicted data and data to be detected of the same type (or different types), and finally end-to-end voice recognition is achieved according to voice category label requirements.
Technical scheme
The first purpose of the invention is to provide a speech recognition method based on contrast predictive coding, which comprises the following steps:
s1, collecting A voice files of each voice category, and preprocessing each voice file to obtain PCM coded voice time sequence data; a is a natural number greater than 1;
s2, constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)1,X2Y); wherein: x1Is the first piece of speech timing data, X, of the triplet2For the second piece of voice time sequence data of the triple, the similar pairing tag Y is defined as 0, and the heterogeneous pairing tag Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; said homogeneous pairingCollecting two pieces of voice time sequence data of each piece of data into voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories;
s3, constructing a pairing fragment data set; the method specifically comprises the following steps:
for a first piece of voice time series data X in the paired data set1Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then all the segments S with the fixed length of m are taken out, the front data with the front half part defined as the segment is recorded as SpTaking out the remaining part, defining as the rear data of the segment, and recording as Ss(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S1The second piece of voice time sequence data X2And label Y for converting the second voice time sequence data X2The name is changed into the segment S' to be compared, and N x M quadruplets (S) are obtainedp,SsS', Y) into N × M paired-fragment datasets;
s4, constructing an artificial neural network; the method comprises the following specific steps:
s401, establishing a countermeasure generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;
s4011, processing front data Sp of the segment through a first converter to obtain SpsThe rear data S of the segmentsProcessed by the second converter to obtain Ssp
S4012, mixing (S)ps,Ssp) Combined into a complete segment Sf
S4013, creating a one-dimensional convolutional neural network, receiving any segment as input, and when the input is a complete segment SfWhen the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is inputfA segment S' to be compared must be input immediately;
s4014, calculating distance d from (Z, Z'):
d=‖Z’-Z‖2
s4015, calculating loss according to the distance d and the label Y:
Figure 742481DEST_PATH_IMAGE001
margin is a user-defined real number larger than 0 and is usually set to be 1;
s5, training a voice recognition network consisting of the first converter, the second converter and the one-dimensional convolutional neural network;
and S6, performing voice recognition through the voice recognition network.
Preferably, M0Is 128 or 256.
Preferably, S5 is specifically:
s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;
s502, training data are M × N matched data segments and labels;
s503, leading the training data into a voice recognition network as input one by one;
s504, calculating loss by taking L as a loss function;
s505, updating the weight of the voice recognition network by using an ADAM optimization method;
s506, M for each process0Counting the data into one batch, and counting the data into one epoch after all the training data are processed once;
s507, training K epochs; k is a natural number.
Preferably, S6 is specifically: from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice SwTo be sliced into slices SwReplacing the segment S ' to be compared in S4013, forming a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { dwAnd finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.
It is a second object of the present invention to provide a speech recognition system based on contrast predictive coding, comprising:
the preprocessing module is used for acquiring A voice files of each voice category and preprocessing each voice file to obtain voice time sequence data of PCM codes; a is a natural number greater than 1;
the paired data set construction module is used for constructing a paired data set of the voice time series data; the paired dataset comprises N triplets (X)1,X2Y); wherein: x1Is the first piece of speech timing data of the triplet, X2For the second piece of voice time sequence data of the triple, the similar pairing tag Y is defined as 0, and the heterogeneous pairing tag Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each piece of data in the heterogeneous pairing set are voice time sequence data of different voice categories;
the pairing fragment data set construction module is used for constructing a pairing fragment data set; the method specifically comprises the following steps:
for the first piece of voice time series data X in the paired data set1Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then all the segments S with the fixed length of m are taken out, the front data of which the front half is defined as the segment is marked as SpTaking out the remaining part, defining as the rear data of the segment, and recording as Ss(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S1The second piece of voice time sequence data X2And label Y for converting the second voice time series data X2The name is changed into the segment S' to be compared, and N x M quadruplets (S) are obtainedp,SsS', Y) of the paired-segment datasets;
an artificial neural network construction module; the construction process comprises the following steps:
s401, establishing a countermeasure generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;
s4011, processing front data Sp of the segment through a first converter to obtain SpsThe rear data S of the segmentsProcessed by the second converter to obtain Ssp
S4012, mixing (S)ps,Ssp) Combined into a complete segment Sf
S4013, creating a one-dimensional convolutional neural network, receiving any segment as input, and when the input is a complete segment SfWhen the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is inputfA segment S' to be compared must be input immediately;
s4014, calculating distance d from (Z, Z'):
d=‖Z’-Z‖2
s4015, calculating loss according to the distance d and the label Y:
Figure 213914DEST_PATH_IMAGE001
margin is a user-defined real number which is larger than 0 and is usually set to be 1;
the training module trains a voice recognition network consisting of the first converter, the second converter and a one-dimensional convolutional neural network;
and the recognition module is used for carrying out voice recognition through a voice recognition network.
Preferably, M0Is 128 or 256.
Preferably, the training process of the training module is as follows:
s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;
s502, training data are M × N matched data segments and labels;
s503, leading the training data into a voice recognition network as input one by one;
s504, calculating loss by taking L as a loss function;
s505, updating the weight of the voice recognition network by using an ADAM optimization method;
s506, M for each process0Counting the data into one batch, and counting the data into one epoch after all the training data are processed once;
s507, training K epochs; k is a natural number.
Preferably, the identification process of the identification module is as follows: taking a reference voice from each category of reference voice library to form S', taking a voice to be recognized from a user and slicing to obtain a slice SwTo slice SwReplacing the segment S ' to be compared in S4013, forming a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { dwAnd finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.
A third object of the present invention is to provide an information data processing terminal for implementing the above-mentioned speech recognition method based on the comparative predictive coding.
It is a fourth object of the present invention to provide a computer-readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the above-described method for speech recognition based on contrast predictive coding.
The invention has the advantages and positive effects that:
the invention fully utilizes a large amount of insufficient voice data acquired by a background, takes the voice data as time sequence data, directly converts end to end without extracting the characteristics of the voice sequence data in the middle, randomly extracts a certain number of fragments with fixed time length from each voice, divides each fragment into front part data and back part data, realizes coding prediction of the back part sequence data through a first converter when the front data is input, realizes the sequence prediction of the front data through a second converter when the back part data is input, directly compares the predicted data with the data to be detected of the same type (or different types) end to end in pairs after combining, and finally realizes end to end voice recognition according to the requirements of voice category labels.
Drawings
FIG. 1 is a flow chart of the construction of a data set in a preferred embodiment of the present invention;
FIG. 2 is a flow chart of the construction of an artificial neural network in a preferred embodiment of the present invention;
FIG. 3 is a flow chart of a Transformer (Transformer) in a preferred embodiment of the present invention;
fig. 4 is a flow chart of speech recognition in a preferred embodiment of the present invention.
Detailed Description
For a further understanding of the contents, features and effects of the invention, reference will now be made to the following examples, which are to be read in connection with the accompanying drawings.
Referring to fig. 1 to 4, a speech recognition method based on contrast prediction coding includes:
s1, collecting A voice files of each voice category, and preprocessing each voice file to obtain PCM coded voice time sequence data; a is a natural number greater than 1;
s2, constructing a pairing data set of the voice time sequence data; the pairing dataset comprises N triples (X)1,X2Y); wherein: x1Is the first piece of speech timing data of the triplet, X2For the second piece of voice time sequence data of the triple, the similar pairing tag Y is defined as 0, and the heterogeneous pairing tag Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each piece of data in the heterogeneous pairing set are voice time sequence data of different voice categories; referring to fig. 1, the construction process of the paired data set specifically includes:
in fig. 1, for the purpose of explaining the problem, three voice categories are taken as an example for detailed description, where the original voice time series data sample on the left side includes voice time series data samples of three voice categories, and the voice time series data of each voice category is distinguished by different padding;
firstly, extracting two pieces of voice time sequence data from voice time sequence data of the same voice category to pair to obtain similar pairs, and defining a label Y of the similar pairs as 0; extracting one piece of voice time sequence data from two pieces of voice time sequence data of different voice categories respectively to pair to obtain heterogeneous pairing, and defining a label Y of the heterogeneous pairing as 1;
then, according to the total pairing number N, defining the number of the voice categories as a category number k, and defining the homogeneous sampling ratio as alpha, and in order to meet the purpose of fair sampling, namely that the probability of each voice category being extracted is the same, setting the following limiting conditions: s1+S2= N/k, calculating the number of pairs S of the same kind to be extracted1And heterogeneous pair number S2
Figure 471720DEST_PATH_IMAGE002
Finally, the same-class pairs, the different-class pairs and the labels y are combined into a data set, and the data set comprises N triples (X)1,X2Y); wherein: x1Is the first piece of speech timing data, X, of the triplet2A second piece of speech timing data of the triplet; y is a label.
The present invention constructs a data set using put-back sample pairings for these speech timing data:
in the same voice category voice time sequence data sample, two voice time sequence data samples are extracted each time, pairing is completed once, and the paired voice time sequence data samples are marked with Y as 0.
One voice type is randomly extracted from voice time sequence data samples of different voice types, one voice type is randomly extracted from other voice time sequence data, one pairing is completed, and the voice time sequence data is marked with Y = 1.
Such similar pairing decimation S1Wheel, heterogeneous pairing extraction S2Wheel, i.e. obtain S1+S2And matching to form a data set, and performing training and testing. The problem that the number of voice time sequence data in various types of voice time sequence data is not enough exists in the playback sampling.
S3, constructing a matching fragment data set; the method specifically comprises the following steps:
for a first piece of voice time series data X in the paired data set1Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then all the segments with the fixed length of m are taken out, the front data of which the front half part is defined as the segment is taken out and is marked as SpGet itThe remaining part is defined as the rear data of the segment and is marked as Ss(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S1The second piece of voice time sequence data X2And label Y for converting the second voice time series data X2The name is changed into a segment S' to be compared, and N-M quadruples (S) are obtainedp,SsS', Y) of the paired-segment datasets;
s4, constructing an artificial neural network; the method specifically comprises the following steps:
s401, establishing a countermeasure generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;
s4011, establishing front data S of the segmentpCorresponding first converter, the processed result is SpsRear data S of the segmentsCorresponding second converter, processed result is Ssp
S4012, the front and back parts of the fragment (S)ps,Ssp) Combined into a complete segment Sf
S4013, creating a one-dimensional convolutional neural network, receiving any segment as input, and when the input is a complete segment SfWhen the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is inputfA segment S' to be compared must be input immediately;
s4014, according to the matched segment (S, S'), for the segment S, after dividing the front part and the rear part, outputting S through the first converter and the second convertersp、SpsAre combined into segments SfThen, outputting Z through a one-dimensional convolution neural network; for the segment S ' to be compared, directly outputting Z ' after passing through a one-dimensional convolutional neural network, and calculating the distance d by (Z, Z '):
d=‖Z’-Z‖2
s4015, calculating loss according to the distance d and the label Y:
Figure 744307DEST_PATH_IMAGE001
margin is a user-defined real number larger than 0 and is usually set to be 1;
s5, training a voice recognition network consisting of the first converter, the second converter and the one-dimensional convolution neural network; the method comprises the following specific steps:
s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;
s502, training data are M × N matched data segments and labels;
s503, importing the training data into a voice recognition network as input one by one;
s504, calculating loss by taking L as a loss function;
s505, updating the weight of the voice recognition network by using an ADAM optimization method;
s506, M for each process0Stripe data (M)0A natural number defined by a user, which is suggested to be 128 or 256), is counted as a batch, and all training data are processed once and counted as an epoch;
s507, training K epochs; k is a natural number;
SS6, carrying out voice recognition through a voice recognition network;
taking a reference voice from each category of reference voice library to form S', taking a voice to be recognized from a user and slicing to obtain a slice SwReplacing the segment S ' to be compared in S4013, one-to-many pairs can be formed in the above way, inputting the segment into the speech recognition network trained in S5, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { dwAnd finding out a subscript corresponding to the minimum value from the list, namely the subscript is the voice class number.
A speech recognition system based on contrast predictive coding, comprising:
the preprocessing module is used for acquiring A voice files of each voice category and preprocessing each voice file to obtain voice time sequence data of PCM codes; a is a natural number greater than 1;
the paired data set construction module is used for constructing a paired data set of the voice time series data; the pairing dataset comprises N triples (X)1,X2Y); wherein: x1Is the first piece of speech timing data, X, of the triplet2For the second piece of voice time sequence data of the triple, the similar pairing tag Y is defined as 0, and the heterogeneous pairing tag Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories; referring to fig. 1, the construction process of the paired data set specifically includes:
in fig. 1, for the purpose of explaining the problem, three speech categories are taken as an example for detailed description, the original speech time series data sample on the left side includes speech time series data samples of three speech categories, and the speech time series data of each speech category are respectively distinguished by different padding;
firstly, extracting two pieces of voice time sequence data from voice time sequence data of the same voice category to pair to obtain similar pairing, and defining a label Y of the similar pairing as 0; extracting one piece of voice time sequence data from two pieces of voice time sequence data of different voice categories respectively to pair to obtain heterogeneous pairing, and defining a label Y of the heterogeneous pairing as 1;
then, according to the total pairing number N, the number of the voice categories is defined as the number k of the categories, the homogeneous-heterogeneous sampling ratio is defined as α, and for example, the purpose of fair sampling is satisfied, that is, the probability of each voice category being extracted is the same, the following limiting conditions are set: s1+S2= N/k, calculating the number of pairs S of the same kind to be extracted1And heterogeneous pairing number S2
Figure 720353DEST_PATH_IMAGE003
Finally, the same-class pairs, the different-class pairs and the labels y are combined into a data set, and the data set comprises N triples (X)1,X2Y); wherein: x1Is the first piece of speech timing data of the triplet, X2As the second of the tripletA bar of voice timing data; y is a label.
The present invention constructs a data set using put-back sample pairings for these speech timing data:
in the same voice category voice time sequence data sample, two voice time sequence data samples are extracted each time, pairing is completed once, and the voice time sequence data samples are marked with Y as 0.
One voice category is randomly extracted from voice time sequence data samples of different voice categories, one voice category is randomly extracted from other voice time sequence data samples, one pairing is completed, and the voice time sequence data samples are marked with Y = 1.
Such similar pairing decimation S1Wheel, heterogeneous pairing extraction S2Wheel, i.e. obtain S1+S2And matching to form a data set, and performing training and testing. The problem that the number of the voice time sequence data in various types of voice time sequence data is not insufficient exists in the playback sampling.
The pairing fragment data set construction module is used for constructing a pairing fragment data set; the method specifically comprises the following steps:
for a first piece of voice time series data X in the paired data set1Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then all the segments with the fixed length of m are taken out, the front data of which the front half part is defined as the segment is taken out and is marked as SpTaking out the remaining part, defining as the rear data of the segment, and recording as Ss(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S1The second piece of voice time sequence data X2And label Y for converting the second voice time series data X2The name is changed into the segment S' to be compared, and N x M quadruplets (S) are obtainedp,SsS', Y) into N × M paired-fragment datasets;
the artificial neural network constructing module is used for constructing an artificial neural network; the method comprises the following specific steps:
s401, establishing a countermeasure generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;
s4011, establishing front data S of the segmentpCorresponding first converter, the processed result is SpsAfter the fragmentSection data SsCorresponding second converter, the result of processing is Ssp
S4012, the front and back parts of the fragment (S)ps,Ssp) Combined into a complete segment Sf
S4013, creating a one-dimensional convolution neural network, receiving any one segment as input, and when the input is a complete segment SfWhen the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is inputfA segment S' to be compared must be input immediately;
s4014, according to the pairing segment (S, S'), for S, after the front part and the rear part are divided, S is output through the first converter and the second convertersp、SpsAre combined into segments SfThen, outputting Z through a one-dimensional convolution neural network; and for the segment S 'to be compared, directly outputting Z' after passing through the one-dimensional convolutional neural network, and calculating the distance d:
d=‖Z’-Z‖2
s4015, calculating loss according to the distance d and the label Y:
Figure 729897DEST_PATH_IMAGE001
margin is a user-defined real number larger than 0 and is usually set to be 1;
the training module trains a voice recognition network formed by the first converter, the second converter and the one-dimensional convolution neural network;
s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;
s502, training data are M × N matched data segments and labels;
s503, importing the training data into a voice recognition network as input one by one;
s504, calculating loss by taking L as a loss function;
s505, updating the weight of the voice recognition network by using an ADAM optimization method;
s506, M for each process0Stripe data (M)0Is composed ofA natural number defined by a user is 128 or 256, which is recommended), and the number is counted as one batch, and the number is counted as one epoch after all training data are processed once;
s507, training K epochs; k is a natural number;
the voice recognition module is used for carrying out voice recognition through a voice recognition network;
from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice SwReplacing the segment S ' to be compared in S4013, one-to-many pairs can be formed in the above way, inputting the segment into the speech recognition network trained in S5, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { dwAnd finding out a subscript corresponding to the minimum value from the list, namely the subscript is the voice class number.
An information data processing terminal is used for realizing the voice recognition method based on the contrast prediction coding.
A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the above-described method for speech recognition based on contrast prediction coding.
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modifications, equivalent variations and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.

Claims (10)

1. A speech recognition method based on contrast predictive coding, comprising the steps of:
s1, collecting A voice files of each voice category, and preprocessing each voice file to obtain PCM coded voice time sequence data; a is a natural number greater than 1;
s2, constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)1,X2Y); wherein: x1Is the first piece of speech timing data of the triplet, X2For the second voice time sequence data of the triple, the same-class matching label Y is defined as 0, and the different-class matching label Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each piece of data in the heterogeneous pairing set are voice time sequence data of different voice categories;
s3, constructing a pairing fragment data set; the method comprises the following specific steps:
for the first piece of voice time series data X in the paired data set1Firstly, according to a fixed length M, randomly intercepting M segments S from the fixed length M, and keeping each segment SFixing the length m; then all the segments S with the fixed length of m are taken out, the front data of which the front half is defined as the segment is marked as SpTaking out the remaining part, defining as the rear data of the segment, and recording as Ss(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S1The second piece of voice time sequence data X2And label Y for converting the second voice time sequence data X2The name is changed into a segment S 'to be compared, and N x M pairing segment data sets consisting of N x M quadruplets (Sp, Ss, S', Y) are obtained;
s4, constructing an artificial neural network; the method specifically comprises the following steps:
s401, establishing a confrontation generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;
s4011, front data S of segmentpProcessed by the first converter to obtain SpsThe rear data S of the segmentsProcessed by the second converter to obtain Ssp
S4012, mixing (S)ps,Ssp) Combined into a complete segment Sf
S4013, creating a one-dimensional convolution neural network, and when the input is a complete segment SfWhen the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is inputfA segment S' to be compared must be input immediately;
s4014, calculating distance d from (Z, Z'):
d=‖Z’-Z‖2
s4015, calculating loss according to the distance d and the label Y:
Figure 434485DEST_PATH_IMAGE001
margin is a real number which is larger than 0 and is defined by a user;
s5, training a voice recognition network consisting of the first converter, the second converter and the one-dimensional convolutional neural network;
and S6, performing voice recognition through the voice recognition network.
2. The method of claim 1, wherein M is M0Is 128 or 256.
3. The speech recognition method based on the contrast prediction coding according to claim 2, wherein S5 specifically comprises:
s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;
s502, training data are M × N matched data segments and labels;
s503, importing the training data into a voice recognition network as input one by one;
s504, calculating loss by taking L as a loss function;
s505, updating the weight of the voice recognition network by using an ADAM optimization method;
s506, M for each process0Counting the data into one batch, and counting the data into one epoch after all the training data are processed once;
s507, training K epochs; k is a natural number.
4. The speech recognition method based on the contrast-predictive coding according to claim 3, wherein S6 specifically comprises: from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice SwTo slice SwReplacing the segment S ' to be compared in S4013 to form a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating by using each pair to obtain Z and Z ', obtaining the distance d through Z and Z ', and finally forming a list { dwAnd finding out a subscript corresponding to the minimum value from the list, namely the subscript is the voice class number.
5. A speech recognition system based on contrast predictive coding, comprising:
the preprocessing module is used for acquiring A voice files of each voice category and preprocessing each voice file to obtain PCM coded voice time sequence data; a is a natural number greater than 1;
the paired data set construction module is used for constructing a paired data set of the voice time series data; the pairing dataset comprises N triples (X)1,X2Y); wherein: x1Is the first piece of speech timing data of the triplet, X2For the second voice time sequence data of the triple, the same-class matching label Y is defined as 0, and the different-class matching label Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories;
the pairing fragment data set construction module is used for constructing a pairing fragment data set; the method comprises the following specific steps:
for the first piece of voice time series data X in the paired data set1Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then all the segments S with the fixed length of m are taken out, the front data of which the front half is defined as the segment is marked as SpTaking out the remaining part, defining as the rear data of the segment, and recording as Ss(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S1The second piece of voice time sequence data X2And label Y for converting the second voice time sequence data X2The name is changed into a segment S 'to be compared, and N x M pairing segment data sets consisting of N x M quadruplets (Sp, Ss, S', Y) are obtained;
an artificial neural network construction module; the construction process comprises the following steps:
s401, establishing a countermeasure generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;
s4011, front data S of segmentpProcessed by the first converter to obtain SpsThe rear data S of the segmentsBy a second transformationProcessor processing to obtain Ssp
S4012, mixing (S)ps,Ssp) Combined into a complete segment Sf
S4013, creating a one-dimensional convolution neural network, and when the input is a complete segment SfWhen the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is inputfA segment S' to be compared must be input immediately;
s4014, calculating distance d from (Z, Z'):
d=‖Z’-Z‖2
s4015, calculating loss according to the distance d and the label Y:
Figure 896690DEST_PATH_IMAGE001
margin is a user-defined real number which is larger than 0;
the training module trains a voice recognition network formed by the first converter, the second converter and the one-dimensional convolutional neural network;
and the recognition module is used for carrying out voice recognition through a voice recognition network.
6. The contrast-predictive coding-based speech recognition system of claim 5, wherein M is0Is 128 or 256.
7. The system of claim 6, wherein the training module performs the training process by:
s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;
s502, training data are M × N matched data segments and labels;
s503, leading the training data into a voice recognition network as input one by one;
s504, calculating loss by taking L as a loss function;
s505, updating the weight of the voice recognition network by using an ADAM optimization method;
s506, M for each process0Counting the data into one batch, and counting the data into one epoch after all the training data are processed once;
s507, training K epochs; k is a natural number.
8. The system of claim 7, wherein the recognition module performs the following steps: taking a reference voice from each category of reference voice library to form S', taking a voice to be recognized from a user and slicing to obtain a slice SwTo slice SwReplacing the segment S ' to be compared in S4013 to form a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating by using each pair to obtain Z and Z ', obtaining the distance d through Z and Z ', and finally forming a list { dwAnd finding out a subscript corresponding to the minimum value from the list, namely the subscript is the voice class number.
9. An information data processing terminal for implementing the speech recognition method based on the contrast predictive coding according to any one of claims 1 to 4.
10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method for speech recognition based on contrast prediction coding according to any one of claims 1 to 4.
CN202210670592.2A 2022-06-15 2022-06-15 Voice recognition method and system based on contrast predictive coding Active CN114783446B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210670592.2A CN114783446B (en) 2022-06-15 2022-06-15 Voice recognition method and system based on contrast predictive coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210670592.2A CN114783446B (en) 2022-06-15 2022-06-15 Voice recognition method and system based on contrast predictive coding

Publications (2)

Publication Number Publication Date
CN114783446A true CN114783446A (en) 2022-07-22
CN114783446B CN114783446B (en) 2022-09-06

Family

ID=82420424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210670592.2A Active CN114783446B (en) 2022-06-15 2022-06-15 Voice recognition method and system based on contrast predictive coding

Country Status (1)

Country Link
CN (1) CN114783446B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190189111A1 (en) * 2017-12-15 2019-06-20 Mitsubishi Electric Research Laboratories, Inc. Method and Apparatus for Multi-Lingual End-to-End Speech Recognition
CN112086087A (en) * 2020-09-14 2020-12-15 广州市百果园信息技术有限公司 Speech recognition model training method, speech recognition method and device
CN112767922A (en) * 2021-01-21 2021-05-07 中国科学技术大学 Speech recognition method for contrast predictive coding self-supervision structure joint training
CN112861976A (en) * 2021-02-11 2021-05-28 温州大学 Sensitive image identification method based on twin graph convolution hash network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190189111A1 (en) * 2017-12-15 2019-06-20 Mitsubishi Electric Research Laboratories, Inc. Method and Apparatus for Multi-Lingual End-to-End Speech Recognition
CN112086087A (en) * 2020-09-14 2020-12-15 广州市百果园信息技术有限公司 Speech recognition model training method, speech recognition method and device
CN112767922A (en) * 2021-01-21 2021-05-07 中国科学技术大学 Speech recognition method for contrast predictive coding self-supervision structure joint training
CN112861976A (en) * 2021-02-11 2021-05-28 温州大学 Sensitive image identification method based on twin graph convolution hash network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨德举等: "基于门控卷积网络与CTC的端到端语音识别", 《计算机工程与设计》 *

Also Published As

Publication number Publication date
CN114783446B (en) 2022-09-06

Similar Documents

Publication Publication Date Title
Nagwani et al. SMS spam filtering and thread identification using bi-level text classification and clustering techniques
EP3610420A1 (en) Neural networks for information extraction from transaction data
CN107229627B (en) Text processing method and device and computing equipment
CN113076739A (en) Method and system for realizing cross-domain Chinese text error correction
US20200293528A1 (en) Systems and methods for automatically generating structured output documents based on structural rules
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
Zadeh Preliminary draft notes on a similarity‐based analysis of time‐series with applications to prediction, decision and diagnostics
CN114783446B (en) Voice recognition method and system based on contrast predictive coding
CN113139558A (en) Method and apparatus for determining a multi-level classification label for an article
CN114707517B (en) Target tracking method based on open source data event extraction
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
CN115828888A (en) Method for semantic analysis and structurization of various weblogs
CN112185572B (en) Tumor specific disease database construction system, method, electronic equipment and medium
CN113011174B (en) Method for identifying purse string based on text analysis
CN114881053A (en) Sentence granularity disintegration test method for neural machine translation system
CN114879945A (en) Long-tail distribution characteristic-oriented diversified API sequence recommendation method and device
CN113934833A (en) Training data acquisition method, device and system and storage medium
CN111859896B (en) Formula document detection method and device, computer readable medium and electronic equipment
CN114491033A (en) Method for building user interest model based on word vector and topic model
CN113807920A (en) Artificial intelligence based product recommendation method, device, equipment and storage medium
CA3092332A1 (en) System and method for machine learning architecture for interdependence detection
Kameswari et al. Predicting Election Results using NLTK
CN116136866B (en) Knowledge graph-based correction method and device for Chinese news abstract factual knowledge
CN114117034B (en) Method and device for pushing texts of different styles based on intelligent model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: Room 01, B03, 4th Floor, No. 17 Guangshun North Street, Chaoyang District, Beijing, 100102

Patentee after: Beijing Information Technology Bote Intelligent Technology Co.,Ltd.

Address before: 100089 602-4, 6th floor, building 3, 11 Changchun Bridge Road, Haidian District, Beijing

Patentee before: Beijing Information Technology Bote Intelligent Technology Co.,Ltd.