CN114783446A

CN114783446A - Voice recognition method and system based on contrast predictive coding

Info

Publication number: CN114783446A
Application number: CN202210670592.2A
Authority: CN
Inventors: 戴亦斌
Original assignee: Beijing Information Technology Bote Intelligent Technology Co ltd
Current assignee: Beijing Information Technology Bote Intelligent Technology Co ltd
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-07-22
Anticipated expiration: 2042-06-15
Also published as: CN114783446B

Abstract

The invention discloses a voice recognition method and a system based on contrast predictive coding, which belong to the technical field of voiceprint recognition and are characterized by comprising the following steps: s1, collecting A voice files of each voice category, and preprocessing each voice file to obtain PCM coded voice time sequence data; s2, constructing a pairing data set of the voice time sequence data; s3, constructing a pairing fragment data set; s4, constructing an artificial neural network; s5, training a voice recognition network consisting of the first converter, the second converter and a one-dimensional convolution neural network; and S6, performing voice recognition through the voice recognition network. The invention fully utilizes a large amount of insufficient voice data acquired by the background, regards the voice data as time sequence data, and directly realizes end-to-end conversion without extracting the voice time sequence data characteristics in the process.

Description

Voice recognition method and system based on contrast predictive coding

Technical Field

The invention belongs to the technical field of voiceprint recognition, and particularly relates to a voice recognition method and system based on comparative predictive coding.

Background

As is known, speech recognition often needs to collect a large amount of speech data, that is, under various background environments, the number of data pieces needs to be sufficient under various semantic (including various speech and dialect) conditions of speech to be recognized. If a voice uttered by using a particular dialect (or text semantics) in a particular background does not acquire enough data, when the voice recognition model is used under the condition, model failures such as reduction in detection accuracy, inability to recognize, and the like may occur. The traditional technology solves the problems by the following steps: most of the features need to be extracted by various feature extraction methods similar to MFCC feature extraction, and then the features are classified to obtain a classification result. In this case, it is important to determine whether the data in each category is sufficient and representative, and if the data is insufficient and atypical, partial features related to the specific category may be missing or distorted, which affects the final classification result.

Disclosure of Invention

The technical purpose is as follows: the invention provides a voice recognition method and a system based on comparison predictive coding; a large amount of insufficient voice data acquired by a background are fully utilized, the voice data are regarded as time sequence data, end-to-end conversion is directly carried out, intermediate voice spectrum features are not required to be extracted, a certain number of fragments with fixed time are extracted randomly by each voice, each fragment is divided into front data and rear data, coding prediction of the rear time sequence data is achieved through a first converter when the front data are input, time sequence prediction of the front data is achieved through a second converter when the rear data are input, after the predicted data are combined, end-to-end paired data comparison is directly carried out on the predicted data and data to be detected of the same type (or different types), and finally end-to-end voice recognition is achieved according to voice category label requirements.

Technical scheme

The first purpose of the invention is to provide a speech recognition method based on contrast predictive coding, which comprises the following steps:

s1, collecting A voice files of each voice category, and preprocessing each voice file to obtain PCM coded voice time sequence data; a is a natural number greater than 1;

s2, constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)₁，X₂Y); wherein: x₁Is the first piece of speech timing data, X, of the triplet₂For the second piece of voice time sequence data of the triple, the similar pairing tag Y is defined as 0, and the heterogeneous pairing tag Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; said homogeneous pairingCollecting two pieces of voice time sequence data of each piece of data into voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories;

s3, constructing a pairing fragment data set; the method specifically comprises the following steps:

for a first piece of voice time series data X in the paired data set₁Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then all the segments S with the fixed length of m are taken out, the front data with the front half part defined as the segment is recorded as S_pTaking out the remaining part, defining as the rear data of the segment, and recording as S_s(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S₁The second piece of voice time sequence data X₂And label Y for converting the second voice time sequence data X₂The name is changed into the segment S' to be compared, and N x M quadruplets (S) are obtained_p，S_sS', Y) into N × M paired-fragment datasets;

s4, constructing an artificial neural network; the method comprises the following specific steps:

s401, establishing a countermeasure generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;

s4011, processing front data Sp of the segment through a first converter to obtain S_psThe rear data S of the segment_sProcessed by the second converter to obtain S_sp；

S4012, mixing (S)_ps，S_sp) Combined into a complete segment S_f；

S4013, creating a one-dimensional convolutional neural network, receiving any segment as input, and when the input is a complete segment S_fWhen the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is input_fA segment S' to be compared must be input immediately;

s4014, calculating distance d from (Z, Z'):

d=‖Z’－Z‖₂；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a user-defined real number larger than 0 and is usually set to be 1;

s5, training a voice recognition network consisting of the first converter, the second converter and the one-dimensional convolutional neural network;

and S6, performing voice recognition through the voice recognition network.

Preferably, M₀Is 128 or 256.

Preferably, S5 is specifically:

s501, initializing a first converter, a second converter and a one-dimensional convolution neural network;

s502, training data are M × N matched data segments and labels;

s503, leading the training data into a voice recognition network as input one by one;

s504, calculating loss by taking L as a loss function;

s505, updating the weight of the voice recognition network by using an ADAM optimization method;

s506, M for each process₀Counting the data into one batch, and counting the data into one epoch after all the training data are processed once;

s507, training K epochs; k is a natural number.

Preferably, S6 is specifically: from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice S_wTo be sliced into slices S_wReplacing the segment S ' to be compared in S4013, forming a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { d_wAnd finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.

It is a second object of the present invention to provide a speech recognition system based on contrast predictive coding, comprising:

the preprocessing module is used for acquiring A voice files of each voice category and preprocessing each voice file to obtain voice time sequence data of PCM codes; a is a natural number greater than 1;

the paired data set construction module is used for constructing a paired data set of the voice time series data; the paired dataset comprises N triplets (X)₁，X₂Y); wherein: x₁Is the first piece of speech timing data of the triplet, X₂For the second piece of voice time sequence data of the triple, the similar pairing tag Y is defined as 0, and the heterogeneous pairing tag Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each piece of data in the heterogeneous pairing set are voice time sequence data of different voice categories;

the pairing fragment data set construction module is used for constructing a pairing fragment data set; the method specifically comprises the following steps:

for the first piece of voice time series data X in the paired data set₁Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then all the segments S with the fixed length of m are taken out, the front data of which the front half is defined as the segment is marked as S_pTaking out the remaining part, defining as the rear data of the segment, and recording as S_s(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S₁The second piece of voice time sequence data X₂And label Y for converting the second voice time series data X₂The name is changed into the segment S' to be compared, and N x M quadruplets (S) are obtained_p，S_sS', Y) of the paired-segment datasets;

an artificial neural network construction module; the construction process comprises the following steps:

S4012, mixing (S)_ps，S_sp) Combined into a complete segment S_f；

s4014, calculating distance d from (Z, Z'):

d=‖Z’－Z‖₂；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a user-defined real number which is larger than 0 and is usually set to be 1;

the training module trains a voice recognition network consisting of the first converter, the second converter and a one-dimensional convolutional neural network;

and the recognition module is used for carrying out voice recognition through a voice recognition network.

Preferably, M₀Is 128 or 256.

Preferably, the training process of the training module is as follows:

s502, training data are M × N matched data segments and labels;

s504, calculating loss by taking L as a loss function;

s507, training K epochs; k is a natural number.

Preferably, the identification process of the identification module is as follows: taking a reference voice from each category of reference voice library to form S', taking a voice to be recognized from a user and slicing to obtain a slice S_wTo slice S_wReplacing the segment S ' to be compared in S4013, forming a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { d_wAnd finding a subscript corresponding to the minimum value from the list, wherein the subscript is the voice category number.

A third object of the present invention is to provide an information data processing terminal for implementing the above-mentioned speech recognition method based on the comparative predictive coding.

It is a fourth object of the present invention to provide a computer-readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the above-described method for speech recognition based on contrast predictive coding.

The invention has the advantages and positive effects that:

the invention fully utilizes a large amount of insufficient voice data acquired by a background, takes the voice data as time sequence data, directly converts end to end without extracting the characteristics of the voice sequence data in the middle, randomly extracts a certain number of fragments with fixed time length from each voice, divides each fragment into front part data and back part data, realizes coding prediction of the back part sequence data through a first converter when the front data is input, realizes the sequence prediction of the front data through a second converter when the back part data is input, directly compares the predicted data with the data to be detected of the same type (or different types) end to end in pairs after combining, and finally realizes end to end voice recognition according to the requirements of voice category labels.

Drawings

FIG. 1 is a flow chart of the construction of a data set in a preferred embodiment of the present invention;

FIG. 2 is a flow chart of the construction of an artificial neural network in a preferred embodiment of the present invention;

FIG. 3 is a flow chart of a Transformer (Transformer) in a preferred embodiment of the present invention;

fig. 4 is a flow chart of speech recognition in a preferred embodiment of the present invention.

Detailed Description

For a further understanding of the contents, features and effects of the invention, reference will now be made to the following examples, which are to be read in connection with the accompanying drawings.

Referring to fig. 1 to 4, a speech recognition method based on contrast prediction coding includes:

s2, constructing a pairing data set of the voice time sequence data; the pairing dataset comprises N triples (X)₁，X₂Y); wherein: x₁Is the first piece of speech timing data of the triplet, X₂For the second piece of voice time sequence data of the triple, the similar pairing tag Y is defined as 0, and the heterogeneous pairing tag Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each piece of data in the heterogeneous pairing set are voice time sequence data of different voice categories; referring to fig. 1, the construction process of the paired data set specifically includes:

in fig. 1, for the purpose of explaining the problem, three voice categories are taken as an example for detailed description, where the original voice time series data sample on the left side includes voice time series data samples of three voice categories, and the voice time series data of each voice category is distinguished by different padding;

firstly, extracting two pieces of voice time sequence data from voice time sequence data of the same voice category to pair to obtain similar pairs, and defining a label Y of the similar pairs as 0; extracting one piece of voice time sequence data from two pieces of voice time sequence data of different voice categories respectively to pair to obtain heterogeneous pairing, and defining a label Y of the heterogeneous pairing as 1;

then, according to the total pairing number N, defining the number of the voice categories as a category number k, and defining the homogeneous sampling ratio as alpha, and in order to meet the purpose of fair sampling, namely that the probability of each voice category being extracted is the same, setting the following limiting conditions: s₁+S₂= N/k, calculating the number of pairs S of the same kind to be extracted₁And heterogeneous pair number S₂：

；

Finally, the same-class pairs, the different-class pairs and the labels y are combined into a data set, and the data set comprises N triples (X)₁，X₂Y); wherein: x₁Is the first piece of speech timing data, X, of the triplet₂A second piece of speech timing data of the triplet; y is a label.

The present invention constructs a data set using put-back sample pairings for these speech timing data:

in the same voice category voice time sequence data sample, two voice time sequence data samples are extracted each time, pairing is completed once, and the paired voice time sequence data samples are marked with Y as 0.

One voice type is randomly extracted from voice time sequence data samples of different voice types, one voice type is randomly extracted from other voice time sequence data, one pairing is completed, and the voice time sequence data is marked with Y = 1.

Such similar pairing decimation S₁Wheel, heterogeneous pairing extraction S₂Wheel, i.e. obtain S₁+S₂And matching to form a data set, and performing training and testing. The problem that the number of voice time sequence data in various types of voice time sequence data is not enough exists in the playback sampling.

S3, constructing a matching fragment data set; the method specifically comprises the following steps:

for a first piece of voice time series data X in the paired data set₁Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then all the segments with the fixed length of m are taken out, the front data of which the front half part is defined as the segment is taken out and is marked as S_pGet itThe remaining part is defined as the rear data of the segment and is marked as S_s(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S₁The second piece of voice time sequence data X₂And label Y for converting the second voice time series data X₂The name is changed into a segment S' to be compared, and N-M quadruples (S) are obtained_p，S_sS', Y) of the paired-segment datasets;

s4, constructing an artificial neural network; the method specifically comprises the following steps:

s4011, establishing front data S of the segment_pCorresponding first converter, the processed result is S_psRear data S of the segment_sCorresponding second converter, processed result is S_sp；

S4012, the front and back parts of the fragment (S)_ps，S_sp) Combined into a complete segment S_f；

s4014, according to the matched segment (S, S'), for the segment S, after dividing the front part and the rear part, outputting S through the first converter and the second converter_sp、S_psAre combined into segments S_fThen, outputting Z through a one-dimensional convolution neural network; for the segment S ' to be compared, directly outputting Z ' after passing through a one-dimensional convolutional neural network, and calculating the distance d by (Z, Z '):

d=‖Z’－Z‖₂；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a user-defined real number larger than 0 and is usually set to be 1;

s5, training a voice recognition network consisting of the first converter, the second converter and the one-dimensional convolution neural network; the method comprises the following specific steps:

s502, training data are M × N matched data segments and labels;

s503, importing the training data into a voice recognition network as input one by one;

s504, calculating loss by taking L as a loss function;

s506, M for each process₀Stripe data (M)₀A natural number defined by a user, which is suggested to be 128 or 256), is counted as a batch, and all training data are processed once and counted as an epoch;

s507, training K epochs; k is a natural number;

SS6, carrying out voice recognition through a voice recognition network;

taking a reference voice from each category of reference voice library to form S', taking a voice to be recognized from a user and slicing to obtain a slice S_wReplacing the segment S ' to be compared in S4013, one-to-many pairs can be formed in the above way, inputting the segment into the speech recognition network trained in S5, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { d_wAnd finding out a subscript corresponding to the minimum value from the list, namely the subscript is the voice class number.

A speech recognition system based on contrast predictive coding, comprising:

the paired data set construction module is used for constructing a paired data set of the voice time series data; the pairing dataset comprises N triples (X)₁，X₂Y); wherein: x₁Is the first piece of speech timing data, X, of the triplet₂For the second piece of voice time sequence data of the triple, the similar pairing tag Y is defined as 0, and the heterogeneous pairing tag Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories; referring to fig. 1, the construction process of the paired data set specifically includes:

in fig. 1, for the purpose of explaining the problem, three speech categories are taken as an example for detailed description, the original speech time series data sample on the left side includes speech time series data samples of three speech categories, and the speech time series data of each speech category are respectively distinguished by different padding;

firstly, extracting two pieces of voice time sequence data from voice time sequence data of the same voice category to pair to obtain similar pairing, and defining a label Y of the similar pairing as 0; extracting one piece of voice time sequence data from two pieces of voice time sequence data of different voice categories respectively to pair to obtain heterogeneous pairing, and defining a label Y of the heterogeneous pairing as 1;

then, according to the total pairing number N, the number of the voice categories is defined as the number k of the categories, the homogeneous-heterogeneous sampling ratio is defined as α, and for example, the purpose of fair sampling is satisfied, that is, the probability of each voice category being extracted is the same, the following limiting conditions are set: s₁+S₂= N/k, calculating the number of pairs S of the same kind to be extracted₁And heterogeneous pairing number S₂：

；

Finally, the same-class pairs, the different-class pairs and the labels y are combined into a data set, and the data set comprises N triples (X)₁，X₂Y); wherein: x₁Is the first piece of speech timing data of the triplet, X₂As the second of the tripletA bar of voice timing data; y is a label.

in the same voice category voice time sequence data sample, two voice time sequence data samples are extracted each time, pairing is completed once, and the voice time sequence data samples are marked with Y as 0.

One voice category is randomly extracted from voice time sequence data samples of different voice categories, one voice category is randomly extracted from other voice time sequence data samples, one pairing is completed, and the voice time sequence data samples are marked with Y = 1.

Such similar pairing decimation S₁Wheel, heterogeneous pairing extraction S₂Wheel, i.e. obtain S₁+S₂And matching to form a data set, and performing training and testing. The problem that the number of the voice time sequence data in various types of voice time sequence data is not insufficient exists in the playback sampling.

for a first piece of voice time series data X in the paired data set₁Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then all the segments with the fixed length of m are taken out, the front data of which the front half part is defined as the segment is taken out and is marked as S_pTaking out the remaining part, defining as the rear data of the segment, and recording as S_s(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S₁The second piece of voice time sequence data X₂And label Y for converting the second voice time series data X₂The name is changed into the segment S' to be compared, and N x M quadruplets (S) are obtained_p，S_sS', Y) into N × M paired-fragment datasets;

the artificial neural network constructing module is used for constructing an artificial neural network; the method comprises the following specific steps:

s4011, establishing front data S of the segment_pCorresponding first converter, the processed result is S_psAfter the fragmentSection data S_sCorresponding second converter, the result of processing is S_sp；

S4013, creating a one-dimensional convolution neural network, receiving any one segment as input, and when the input is a complete segment S_fWhen the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is input_fA segment S' to be compared must be input immediately;

s4014, according to the pairing segment (S, S'), for S, after the front part and the rear part are divided, S is output through the first converter and the second converter_sp、S_psAre combined into segments S_fThen, outputting Z through a one-dimensional convolution neural network; and for the segment S 'to be compared, directly outputting Z' after passing through the one-dimensional convolutional neural network, and calculating the distance d:

d=‖Z’－Z‖₂；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a user-defined real number larger than 0 and is usually set to be 1;

the training module trains a voice recognition network formed by the first converter, the second converter and the one-dimensional convolution neural network;

s502, training data are M × N matched data segments and labels;

s504, calculating loss by taking L as a loss function;

s506, M for each process₀Stripe data (M)₀Is composed ofA natural number defined by a user is 128 or 256, which is recommended), and the number is counted as one batch, and the number is counted as one epoch after all training data are processed once;

s507, training K epochs; k is a natural number;

the voice recognition module is used for carrying out voice recognition through a voice recognition network;

from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice S_wReplacing the segment S ' to be compared in S4013, one-to-many pairs can be formed in the above way, inputting the segment into the speech recognition network trained in S5, calculating Z and Z ' by using each pair, obtaining the distance d through Z and Z ', and finally forming a list { d_wAnd finding out a subscript corresponding to the minimum value from the list, namely the subscript is the voice class number.

An information data processing terminal is used for realizing the voice recognition method based on the contrast prediction coding.

A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the above-described method for speech recognition based on contrast prediction coding.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modifications, equivalent variations and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.

Claims

1. A speech recognition method based on contrast predictive coding, comprising the steps of:

s2, constructing a pairing data set of the voice time sequence data; the paired dataset comprises N triplets (X)₁，X₂Y); wherein: x₁Is the first piece of speech timing data of the triplet, X₂For the second voice time sequence data of the triple, the same-class matching label Y is defined as 0, and the different-class matching label Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each piece of data in the heterogeneous pairing set are voice time sequence data of different voice categories;

s3, constructing a pairing fragment data set; the method comprises the following specific steps:

for the first piece of voice time series data X in the paired data set₁Firstly, according to a fixed length M, randomly intercepting M segments S from the fixed length M, and keeping each segment SFixing the length m; then all the segments S with the fixed length of m are taken out, the front data of which the front half is defined as the segment is marked as S_pTaking out the remaining part, defining as the rear data of the segment, and recording as S_s(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S₁The second piece of voice time sequence data X₂And label Y for converting the second voice time sequence data X₂The name is changed into a segment S 'to be compared, and N x M pairing segment data sets consisting of N x M quadruplets (Sp, Ss, S', Y) are obtained;

s401, establishing a confrontation generation model combined with a variational self-coding condition for extracting implicit characteristics of voice time sequence data;

s4011, front data S of segment_pProcessed by the first converter to obtain S_psThe rear data S of the segment_sProcessed by the second converter to obtain S_sp；

S4012, mixing (S)_ps，S_sp) Combined into a complete segment S_f；

S4013, creating a one-dimensional convolution neural network, and when the input is a complete segment S_fWhen the input is the segment S 'to be compared, the output is recorded as Z'; each time a complete segment S is input_fA segment S' to be compared must be input immediately;

s4014, calculating distance d from (Z, Z'):

d=‖Z’－Z‖₂；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a real number which is larger than 0 and is defined by a user;

and S6, performing voice recognition through the voice recognition network.

2. The method of claim 1, wherein M is M₀Is 128 or 256.

3. The speech recognition method based on the contrast prediction coding according to claim 2, wherein S5 specifically comprises:

s502, training data are M × N matched data segments and labels;

s504, calculating loss by taking L as a loss function;

s507, training K epochs; k is a natural number.

4. The speech recognition method based on the contrast-predictive coding according to claim 3, wherein S6 specifically comprises: from the reference voice library of each category, a reference voice is taken from each category to form S', a voice to be recognized is taken from the user and sliced to obtain a slice S_wTo slice S_wReplacing the segment S ' to be compared in S4013 to form a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating by using each pair to obtain Z and Z ', obtaining the distance d through Z and Z ', and finally forming a list { d_wAnd finding out a subscript corresponding to the minimum value from the list, namely the subscript is the voice class number.

5. A speech recognition system based on contrast predictive coding, comprising:

the preprocessing module is used for acquiring A voice files of each voice category and preprocessing each voice file to obtain PCM coded voice time sequence data; a is a natural number greater than 1;

the paired data set construction module is used for constructing a paired data set of the voice time series data; the pairing dataset comprises N triples (X)₁，X₂Y); wherein: x₁Is the first piece of speech timing data of the triplet, X₂For the second voice time sequence data of the triple, the same-class matching label Y is defined as 0, and the different-class matching label Y is defined as 1; each data of the homogeneous pairing set and each data of the heterogeneous pairing set are composed of two pieces of voice time sequence data; the two pieces of voice time sequence data of each data of the same kind pairing set are voice time sequence data of the same voice category; two pieces of voice time sequence data of each data of the heterogeneous pairing set are voice time sequence data of different voice categories;

the pairing fragment data set construction module is used for constructing a pairing fragment data set; the method comprises the following specific steps:

for the first piece of voice time series data X in the paired data set₁Firstly, randomly intercepting M segments S from a fixed length M, wherein each segment S keeps the fixed length M; then all the segments S with the fixed length of m are taken out, the front data of which the front half is defined as the segment is marked as S_pTaking out the remaining part, defining as the rear data of the segment, and recording as S_s(ii) a Finally, aiming at each segment S, copying the first piece of voice time sequence data X corresponding to each segment S₁The second piece of voice time sequence data X₂And label Y for converting the second voice time sequence data X₂The name is changed into a segment S 'to be compared, and N x M pairing segment data sets consisting of N x M quadruplets (Sp, Ss, S', Y) are obtained;

s4011, front data S of segment_pProcessed by the first converter to obtain S_psThe rear data S of the segment_sBy a second transformationProcessor processing to obtain S_sp；

S4012, mixing (S)_ps，S_sp) Combined into a complete segment S_f；

s4014, calculating distance d from (Z, Z'):

d=‖Z’－Z‖₂；

s4015, calculating loss according to the distance d and the label Y:

；

margin is a user-defined real number which is larger than 0;

the training module trains a voice recognition network formed by the first converter, the second converter and the one-dimensional convolutional neural network;

6. The contrast-predictive coding-based speech recognition system of claim 5, wherein M is₀Is 128 or 256.

7. The system of claim 6, wherein the training module performs the training process by:

s502, training data are M × N matched data segments and labels;

s504, calculating loss by taking L as a loss function;

s507, training K epochs; k is a natural number.

8. The system of claim 7, wherein the recognition module performs the following steps: taking a reference voice from each category of reference voice library to form S', taking a voice to be recognized from a user and slicing to obtain a slice S_wTo slice S_wReplacing the segment S ' to be compared in S4013 to form a one-to-many pair in this way, inputting the pair into the voice recognition network, calculating by using each pair to obtain Z and Z ', obtaining the distance d through Z and Z ', and finally forming a list { d_wAnd finding out a subscript corresponding to the minimum value from the list, namely the subscript is the voice class number.

9. An information data processing terminal for implementing the speech recognition method based on the contrast predictive coding according to any one of claims 1 to 4.

10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method for speech recognition based on contrast prediction coding according to any one of claims 1 to 4.