CN115294985A

CN115294985A - Multi-classification voice command recognition method and system based on comparative learning

Info

Publication number: CN115294985A
Application number: CN202211219831.9A
Authority: CN
Inventors: 戴亦斌
Original assignee: Beijing Information Technology Bote Intelligent Technology Co ltd
Current assignee: Beijing Information Technology Bote Intelligent Technology Co ltd
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2022-11-04
Anticipated expiration: 2042-10-08
Also published as: CN115294985B

Abstract

The invention discloses a multi-classification voice command recognition method and a multi-classification voice command recognition system based on comparison learning, which belong to the technical field of voice recognition and are characterized by comprising the following steps: s1, constructing a full command example data set X ₂ (ii) a S2, training a feature extraction network based on comparison learning; the method comprises the following specific steps: s201, constructing single training input data; s202, processing single training input data in the to-be-trained feature extraction network; s203, updating the weight of the one-dimensional convolutional neural network in a gradient descent mode by taking L as a loss function; and S3, performing voice command recognition by using the feature extraction network. The invention can improve the multi-classification voice through the method for comparing and learning the multi-classification voice commandThe recognition accuracy of the command.

Description

Multi-classification voice command recognition method and system based on comparative learning

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a multi-classification voice command recognition method and system based on comparison learning.

Background

It is well known that in a multi-class speech command recognition process: voice command recognition often requires the collection of a large amount of voice data, and the number of data items for each type of speaker to issue voice commands under various semantic (including various tones, dialects, emotions) conditions in various background environments needs to be balanced. Typically, for a voice command system, there will be a certain set of voice commands, and the number of semantics to be recognized for the commands is assumed to be k. In order to recognize these voice commands, for each command, voices of different speakers who send commands need to be collected, and the diversity of data needs to be considered in the collection, at this time, due to the influence of many factors such as gender, age, accent, dialect, emotion and the like of the speakers, mass data often need to be collected for the same type of command, meanwhile, the data of various types are often unbalanced, and the collected voice command data set is simplified in structure and generally shown in fig. 1. If a voice command sent by a certain type of speaker under a certain special background condition cannot acquire enough data, when the recognition model is used under the condition, model failure phenomena such as detection precision reduction, incapability of recognition and the like can occur.

Disclosure of Invention

The technical purpose is as follows: the invention provides a multi-classification voice command recognition method and a multi-classification voice command recognition system based on comparison learning; the method fully utilizes the unbalanced voice command semantic data acquired by the background, extracts the characteristics of background data which are only related to the general difference between the voice commands and are not related to the specific voice commands through a multi-classification voice command comparison learning method, and utilizes a specially designed generated data sampling strategy to realize data balance and voice command comparison, thereby improving the recognition precision of the multi-classification voice commands.

Technical scheme

The invention provides a multi-classification voice command recognition method based on comparative learning, which comprises the following steps:

s1, constructing a full command example data set X ₂ (ii) a The method specifically comprises the following steps:

s101, determining k voice command types to be recognized according to application requirements, collecting corresponding voice command PCM data according to the voice command types, and forming different voice command data sets X ₁ ；

S102, from each voice command data set X ₁ Randomly extracting a piece of voice command PCM data to form a piece of full command sample data containing k pieces of voice command PCM data;

s103, repeating N rounds S102, k and N being more than 1Obtaining N pieces of full command sample data, and creating a full command sample data set X containing N X k pieces of voice command PCM data through the N pieces of full command sample data ₂ ；

S2, training a feature extraction network based on comparison learning; the method comprises the following specific steps:

s201, constructing a single piece of training input data; the method comprises the following steps:

s2011, one voice command data set X ₁ The voice command data set X is taken as the whole target training data ₁ Comprising S voice commands and Y categories, said voice command data set X ₁ Each voice command PCM data in is called anchor data x _a Each anchor data x _a The corresponding voice command category is denoted as y _a ；

S2012, sample data set X from full command ₂ In (2), a piece of full command sample data is randomly extracted in a put-back mode and is marked as x _e X is said _e K command examples in sequence are noted as x _e1 ,x _e2 ,…,x _ek ；

S2013, mixing the vector x = (x) _a ,x _e1 ,x _e2 ,…,x _ek ) As the x input of the neural network to be trained;

s2014, mixing y _a As the y input of the neural network to be trained;

s2015, repeating S2012 to S2014 for S times to obtain S single training input data;

s2016, using voice command data set X ₁ Taking the whole S voice command data as objects, executing S2015 by each command, and obtaining S × S single training input data; the single piece of training input data belongs to Y categories;

s202, processing single training input data in the to-be-trained feature extraction network; the method comprises the following steps:

s2021, mixing the vector x = (x) _a ,x _e1 ,x _e2 ,…,x _ek ) Inputting a feature extraction network to be trained, and outputting standard single command voice features z;

s2022, for the data item (x) _a ,x _e1 ,x _e2 ,…,x _ek ) Performing cubic spline interpolation to obtainStandard command voice data x ' = (x ' of same dimension) ' _a ，x’ _e1 ，x’ _e2 ，…，x’ _ek )；

S2023, mixing x' _a ，x’ _e1 ，x’ _e2 ，…，x’ _ek Sequentially inputting into one-dimensional convolutional neural network to obtain output z _a ,z _e1 ,z _e2 ,…,z _ek (ii) a Wherein: z is a radical of _a And z _ei Are vectors with the same dimension, i is an integer between 1 and k;

s2024, with z _a 、z _ei 、y _a Calculating the loss L for input;

s203, updating the weight of the one-dimensional convolutional neural network in a gradient descent mode by taking L as a loss function;

and S3, performing voice command recognition by using the feature extraction network.

Preferably, S2024 is specifically:

first calculate z _a And z _ei Similarity between sim (z) _a ，z _ei )：

；

Wherein i is an integer between 1 and k,

denotes z _a The j-th number;

then z is calculated _a And z _ei Loss of contrast between L (z) _a ，z _ei ）：

Wherein:

means if and only if

Calculating;

is a temperature coefficient and is a decimal constant between 0 and 1;

defining: z is a radical of _e For z in a single piece of training data _e1 、z _e2 、…、z _ek General term of (c), then a single training data z _a And z _e The total loss of contrast between L is:

。

preferably, S203 is specifically: let M ₀ PCM data count for a batch of voice commands, M per transaction ₀ When the processing of each voice command or all voice commands is finished, updating the weight once for the one-dimensional convolution neural network; voice command data set X ₁ One pass of all command processing is defined as an epoch, and training is terminated when the number of epochs trained reaches a threshold E.

Preferably, S3 is specifically:

s301, establishing a full command sample feature set Z _C (ii) a Sample the full command into a voice data set X ₂ The commands in the system are input into the feature extraction network one by one, and output results are collected according to the original sequence, namely, a full command sample feature set Z is obtained _C ；

S302, sample feature set Z by using full command _C And a feature extraction network for identifying the command type C; the method comprises the following specific steps:

random sample feature set Z from full command _C N pieces of data are selected to form a comparison set Z _T ；

Command x to be recognized _m0 After the trained feature extraction network processing, outputting the command feature z to be recognized _m0 (ii) a Command characteristic z to be recognized _m0 And comparison set Z _T After n x k characteristics are compared, the characteristic z of the command to be identified is obtained _m0 And comparison set Z _T A feature distance collection D; for command feature z to be recognized _m0 And the slave comparison set Z _T Example feature command z of random fetch _(ii，jj) The distance between the two is calculated by the following formula:

；

wherein ii is a natural number between 1 and n, jj is a natural number between 1 and k, l is a dimension, and p is a natural number between 1 and l;

after averaging by class within a class, k average distances d are calculated for k classes ₁ ，d ₂ ,…,d _k Finding out the minimum value and determining the subscript C; the output C, C is the recognized command.

It is a second object of the present invention to provide a multi-classification voice command recognition system based on contrast learning, comprising:

constructing a module: building a full command example dataset X ₂ (ii) a The construction process comprises the following steps:

firstly, according to application requirements, k voice command categories to be recognized are determined, corresponding voice command PCM data are collected according to the voice command categories, and different voice command data sets X are formed ₁ ；

And then from each voice command data set X ₁ In the method, a piece of voice command PCM data is randomly extracted by a playback unit to form a piece of full command example data containing k pieces of voice command PCM data;

repeating N rounds of random extraction with playback, wherein k and N are natural numbers larger than 1 to obtain N pieces of full command example data, and creating a full command example data set X containing N X k voice command PCM data through the N pieces of full command example data ₂ ；

A training module: training a feature extraction network based on comparative learning; the specific process is as follows:

s2011, one voice command data set X ₁ The voice command data set X is taken as the whole target training data ₁ Comprising S voice commands and Y categories, the voice command data set X ₁ Each voice command PCM data in is called anchor data x _a Number of each anchor pointAccording to x _a The corresponding voice command category is denoted as y _a ；

S2012, sample data set X from full command ₂ In (2), a piece of full command sample data is randomly extracted in a put-back mode and is marked as x _e X is said _e K command instances in (a) are sequentially noted as x _e1 ,x _e2 ,…,x _ek ；

s2014, mixing y _a As the y input of the neural network to be trained;

s2022, data item (x) _a ,x _e1 ,x _e2 ,…,x _ek ) And (5) carrying out cubic spline interpolation to obtain standard command voice data x ' = (x ') with the same dimensionality ' _a ，x’ _e1 ，x’ _e2 ，…，x’ _ek )；

S2023, mixing x' _a ，x’ _e1 ，x’ _e2 ，…，x’ _ek Sequentially inputting into one-dimensional convolutional neural network to obtain output z _a ,z _e1 ,z _e2 ,…,z _ek (ii) a Wherein: z is a radical of formula _a And z _ei Are vectors with the same dimension, i is an integer between 1 and k;

s2024, in z _a 、z _ei 、y _a Calculating the loss L for input;

s203, updating the weight of the one-dimensional convolutional neural network in a gradient descending mode by taking L as a loss function;

an identification module: and performing voice command recognition by using the feature extraction network.

Preferably, S2024 is specifically:

first calculate z _a And z _ei Similarity between sim (z) _a ，z _ei )：

；

Wherein i is an integer between 1 and k,

denotes z _a The j-th number;

Wherein:

means if and only if

Calculating;

is a temperature coefficient and is a decimal constant between 0 and 1;

defining: z is a radical of _e For z in a single piece of training data _e1 、z _e2 、…、z _ek A general term of (1), then a single training data z _a And z _e The total loss of contrast between L is:

。

preferably, the first and second liquid crystal display panels are,s203 specifically includes: let M ₀ PCM data count for a batch of voice commands, M per transaction ₀ When the processing of each voice command or all voice commands is finished, updating the weight once for the one-dimensional convolution neural network; voice command data set X ₁ One pass of all command processing is defined as an epoch, and training is terminated when the number of epochs trained reaches a threshold E.

Preferably, S3 is in particular:

S302, sample feature set Z by using full command _C And a feature extraction network for identifying the command category C; the method specifically comprises the following steps:

Command x to be recognized _m0 After the trained feature extraction network processing, outputting the command feature z to be recognized _m0 (ii) a Command feature to be recognized z _m0 And comparison set Z _T After n x k characteristics are compared, the characteristic z of the command to be identified is obtained _m0 And comparison set Z _T A feature distance collection D; for command feature z to be recognized _m0 And from the comparison set Z _T Example feature command z of random fetch _(ii，jj) The distance between the two is calculated by the following formula:

；

The third objective of the invention is to provide an information data processing terminal, which is used for realizing the multi-classification voice command recognition method based on comparative learning.

It is a fourth object of the present invention to provide a computer-readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the above-described contrast learning-based multi-class speech command recognition method.

The invention has the advantages and positive effects that:

the invention uses deep learning, fully utilizes the unbalanced voice command semantic data acquired by the background, extracts the characteristics which are only related to the general difference between the voice commands and are not related to the specific voice commands in the background data through a unique multi-classification voice command comparison learning method, and utilizes a specially designed generated data sampling strategy to realize data equalization and voice command comparison, thereby improving the recognition precision of the multi-classification voice commands.

Drawings

FIG. 1 is a set of voice data commands X in a preferred embodiment of the present invention ₁ The structure of (1);

FIG. 2 is an exemplary voice data set X for a full command in a preferred embodiment of the present invention ₂ The construction flow chart of (1);

FIG. 3 is an exemplary voice data set X for a full command in a preferred embodiment of the present invention ₂ The structure of (1);

FIG. 4 is a flow chart of the training of a feature extraction network in a preferred embodiment of the present invention;

FIG. 5 is an exemplary feature set Z for a full command in a preferred embodiment of the invention _C The construction flow chart of (1);

fig. 6 is a flow chart of the operation of the feature extraction network in the preferred embodiment of the present invention.

Detailed Description

For a further understanding of the contents, features and effects of the invention, reference should be made to the following examples, which are set forth in the following detailed description and are to be read in conjunction with the accompanying drawings.

The invention mainly solves a key problem in the field of speech recognition: voice command recognition often requires the collection of a large amount of voice data, and the number of data items for each type of speaker to issue voice commands under various semantic (including various tones, dialects, emotions) conditions in various background environments needs to be balanced. With a few examples of voice commands covering all command categories, accurate recognition of command semantics has always been a major difficulty in voice recognition.

Referring to fig. 1 to 6, a multi-class voice command recognition method based on comparative learning includes:

s1, constructing a full command example data set X ₂ (ii) a As shown in fig. 2, the method specifically includes:

s101, firstly, determining the type of a voice command to be recognized according to the requirements of specific applications, and collecting corresponding voice command PCM data according to the type of the command, wherein each piece of voice command PCM data is collected under the corresponding type to form different type sets. The number of data pieces under each category set may be different. It is assumed here that the number of voice command classes is k, which is self-defined by the user. Voice command data set X ₁ ；

S102, under k category sets, from each voice command data set X ₁ In the method, a piece of voice command PCM data is randomly extracted to form a piece of full command example data, namely, the full command example data containing k pieces of voice command PCM data is formed;

s103, repeating the N rounds S102 to form a full command example data set X containing N pieces of full command example data ₂ . Wherein k and N are natural numbers larger than 1, N pieces of full command example data are used in total, and a full command example data set X containing N X k voice command PCM data is created by the N pieces of full command example data ₂ As shown in fig. 3;

s2, full command example data set X established in S1 ₂ As shown in fig. 4: training a feature extraction network based on comparative learning; the method specifically comprises the following steps:

s201, constructing single training input data; the method comprises the following steps:

s2011, one voice command data set X ₁ As global target training data, assume the voice command data set X ₁ Including S voice commands and Y categories, for a voice command data set X ₁ In each caseA piece of speech command PCM data is treated as data to be trained, called anchor data x _a Each piece of anchor data belongs to a command class, denoted as y _a . Each anchor data x _a The corresponding voice command category is denoted as y _a ；

S2012, sample data set X from full command ₂ In (2), a piece of full command sample data is randomly extracted in a put-back mode and is marked as x _e Obviously, x _e Contains k command instances, x _e K command examples in sequence are noted as x _e1 ,x _e2 ,…,x _ek ；

S2013, converting the vector x = (x) _a ,x _e1 ,x _e2 ,…,x _ek ) As the x input of the neural network to be trained;

s2014, mixing y _a As the y input of the neural network to be trained;

s2015, repeating S2012 to S2014 a fixed number of times, the size of the fixed number being determined by the user, where the fixed number is denoted as S. After the repetition is completed, will get x _a Inputting data for s individual trainings of an anchor point;

s2016, using voice command data set X ₁ Taking the whole S voice command data as objects, executing S2015 by each command, and obtaining S × S single training input data; are classified into Y categories;

s2021, the to-be-trained feature extraction network takes the vector x as input, and standard single command voice features z are output after processing. I.e. vector x = (x) _a ,x _e1 ,x _e2 ,…,x _ek ) Inputting a feature extraction network to be trained, and outputting a standard single command voice feature z;

s2022, after the vector x is input, each data item (x) in the vector x is input _a ,x _e1 ,x _e2 ,…,x _ek ) Performing cubic spline interpolation (cubiccspline) calculation conversion to obtain a one-dimensional array with the same dimension, and recording the one-dimensional array as standard command voice data x '= (x' _a ，x’ _e1 ，x’ _e2 ，…，x’ _ek ). Three timesSpline interpolation computation process design and hyper-parameter selection are decided by users.

S2023, mixing x' _a ，x’ _e1 ，x’ _e2 ，…，x’ _ek Sequentially inputting the data to a one-dimensional convolution neural network and sequentially obtaining output z _a ,z _e1 ,z _e2 ,…,z _ek (ii) a Wherein: z is a radical of formula _a And z _ei Are vectors with the same dimension, i is an integer between 1 and k; the number of dimensions here is l. The design of the one-dimensional convolution neural network and the selection of the hyper-parameters are decided by a user, and the input dimension of the one-dimensional convolution neural network is only required to be kept consistent with the output dimension calculated by the cubic spline interpolation.

S2024, in z _a 、z _ei 、y _a Calculating the loss L as input; the calculation method is as follows:

first calculate z _a And z _ei Similarity between sim (z) _a ，z _ei )：

；

Wherein i is an integer between 1 and k,

denotes z _a The j-th number;

then, z is calculated _a And z _ei Loss of contrast between L (z) _a ，z _ei ）：

Wherein:

means if and only if

And (6) performing calculation.

Is a temperature coefficient and is a decimal constant between 0 and 1, and is determined by a user.

Thus, a single piece of training data z _a And z _e The total loss of contrast between L is:

；

let M ₀ The number is a batch order number, which is determined by the user and can be set to 64 or 128. M Per treatment ₀ And when the processing of each command or all commands is finished, updating the weight value of the one-dimensional convolutional neural network once. Voice command data set X ₁ All command processing passes in (1) are defined as an epoch. The user may define an epoch threshold E and when the number of epochs trained reaches E, the training is terminated.

S3, voice command recognition is carried out by using a feature extraction network; the method specifically comprises the following steps:

s301, as shown in fig. 5: establishing a full command example feature set Z _C ；

Sample data set X with the full command established in S1 ₂ The commands in the system are input into the trained feature extraction network one by one, and the output results are collected according to the original sequence to obtain a full command sample feature set Z _C ；

S302, as shown in FIG. 6: sample feature set Z with full command _C The trained feature extraction network identifies the command type C; the method specifically comprises the following steps:

random sample feature set Z from full command _C N pieces of data (note that according to the description of S301, each full command example feature should contain k commands, and in the same full command example feature, each command example feature should not be repeated and not leaked, and exactly corresponds to one type of command voice) are selected to form a comparison set Z _T ；

The voice command sent by the end user (client) is set as a command x to be recognized _m0 The command is extracted through the trained feature extraction networkAfter processing, the result is the command characteristic z to be recognized _m0 . Command feature to be recognized z _m0 And comparison set Z _T After n x k characteristics are compared (distance is calculated), the command characteristic z to be recognized is obtained _m0 And comparison set Z _T D is the set of distances of each feature, D is from D _(1,1) To d _(k,n) N x k distance calculations. For command feature z to be recognized _m0 Set Z compared with arbitrary slaves _T Example feature Command taken in (set to z) _(ii，jj) ) The distance between the two is calculated by the following formula:

；

wherein ii is a natural number between 1 and n (including 1 and n), jj is a natural number between 1 and k (including 1 and k), l is a dimension, and p is a natural number between 1 and l (including 1 and l).

After averaging by category within a category, k average distances d can be calculated for k categories ₁ ，d ₂ ,…,d _k . The minimum value can be found at this time and its index C can be determined.

The output C, C is the recognized command.

A comparative learning based multi-class speech command recognition system comprising:

constructing a module: building a full command example dataset X ₂ (ii) a As shown in fig. 2, the method specifically includes:

s101, firstly, determining the type of a voice command to be recognized according to the requirements of specific applications, and collecting corresponding voice command PCM data according to the type of the command, wherein each piece of voice command PCM data is collected under the corresponding type to form different type sets. The number of data pieces under each category set may be different. It is assumed here that the number of classes of voice commands is k, which is self-defined by the user. Voice command data set X ₁ ；

S102, under k category sets, from each voice command data set X ₁ In the method, a piece of PCM data of a voice command is randomly extracted to form a piece of full command sample data, namely a piece of PCM data containing k pieces of voice is formedFull command example data that commands PCM data;

s103, repeating the N rounds S102 to form a full command example data set X containing N pieces of full command example data ₂ . Wherein k and N are natural numbers larger than 1, N pieces of full command example data are totally generated, and a full command example data set X containing N X k voice command PCM data is created through the N pieces of full command example data ₂ As shown in fig. 3;

a training module: full command example dataset X built in building Module ₂ As shown in fig. 4: training a feature extraction network based on comparative learning; the method specifically comprises the following steps:

s2011, one voice command data set X ₁ As global target training data, assume the voice command data set X ₁ Including S voice commands and Y categories, for a voice command data set X ₁ Wherein each voice command PCM data is treated as data to be trained, called anchor data x _a Each piece of anchor data belongs to a command class, denoted as y _a . Each anchor data x _a The corresponding voice command category is denoted as y _a ；

S2012, sample data set X from full command ₂ In (2), a play-back randomly extracts a full command sample data, marked as x _e Obviously, x _e Contains k command instances, x _e K command examples in sequence are noted as x _e1 ,x _e2 ,…,x _ek ；

s2014, mixing y _a As the y input of the neural network to be trained;

s2015, repeating S2012 to S2014 a fixed number of times, the size of the fixed number being determined by the user, where the fixed number is denoted as S. After the repetition is completed, the result will be x _a S single training input data for the anchor point;

s2016, using voice command data set X ₁ All-in-one S-bar speechThe command data is used as an object, each command executes S2015, and S × S single training input data are obtained; the single piece of training input data belongs to Y categories;

s2021, the to-be-trained feature extraction network takes the vector x as input, and standard single command voice features z are output after processing. I.e. vector x = (x) _a ,x _e1 ,x _e2 ,…,x _ek ) Inputting a feature extraction network to be trained, and outputting standard single command voice features z;

s2022, after the vector x is input, each data item (x) in the vector x is input _a ,x _e1 ,x _e2 ,…,x _ek ) Performing cubic spline interpolation (cubiccspline) calculation conversion to obtain a one-dimensional array with the same dimension, and recording the one-dimensional array as standard command voice data x '= (x' _a ，x’ _e1 ，x’ _e2 ，…，x’ _ek ). The cubic spline interpolation calculation process design and the hyper-parameter selection are decided by users.

S2023, mixing x' _a ，x’ _e1 ，x’ _e2 ，…，x’ _ek Sequentially inputting the data to a one-dimensional convolution neural network to sequentially obtain output z _a ,z _e1 ,z _e2 ,…,z _ek (ii) a Wherein: z is a radical of _a And z _ei Are vectors with the same dimension, i is an integer between 1 and k; the number of dimensions here is l. The design of the one-dimensional convolution neural network and the selection of the hyper-parameters are decided by a user, and the input dimension of the one-dimensional convolution neural network is only required to be kept consistent with the output dimension calculated by the cubic spline interpolation.

first calculate z _a And z _ei Similarity between sim (z) _a ，z _ei )：

；

WhereinI is an integer of 1 to k,

denotes z _a J is the number of the middle, l is the dimension;

Wherein the content of the first and second substances,

means if and only if

And (6) performing calculation.

let M ₀ The number is a batch order number, which is determined by the user and can be set to 64 or 128. M per treatment ₀ And when the processing of each command or all commands is finished, updating the weight value of the one-dimensional convolutional neural network once. Voice command data set X ₁ All command processing passes in (1) are defined as an epoch. The user may define an epoch threshold E and when the number of epochs of training reaches E, the training is terminated.

An identification module: performing voice command recognition by using a feature extraction network; the method specifically comprises the following steps:

s301, as shown in FIG. 5: establishing a full command sample feature set Z _C ；

Sample data set X of full commands established in building Block ₂ The commands in the system are input into the trained feature extraction network one by one, and output results of the commands are collected according to the original sequence to obtain a full command sample feature set Z _C ；

S302, as shown in FIG. 6: using full command exemplary feature set Z _C The trained feature extraction network identifies the command type C; the method specifically comprises the following steps:

random slave full command example feature set Z _C N pieces of data (note that according to the description of S301, each full command example feature should contain k commands, and in the same full command example feature, each command example feature should not be repeated and not leaked to exactly correspond to a type of command voice) are selected to form a comparison set Z _T ；

The voice command sent by the end user (client) is set as a command x to be recognized _m0 After the command is processed by the trained feature extraction network, the result is the feature z of the command to be recognized _m0 . Command characteristic z to be recognized _m0 And comparison set Z _T After n × k features are compared (distance is calculated), the command feature z to be recognized is obtained _m0 And comparison set Z _T D is the set of distances of each feature, D is from D _(1,1) To d _(k,n) A total of n x k distance calculations. For command feature z to be recognized _m0 Set Z of comparisons with arbitrary slaves _T Example feature Command taken in (set to z) _(ii，jj) ) The distance between the two is calculated by the formula:

；

After averaging by category within a category, k average distances d can be calculated for k categories ₁ ，d ₂ ,…,d _k . At this time, can find outAnd determining its subscript C.

The output C, C is the recognized command.

An information data processing terminal is used for realizing the multi-classification voice command recognition method based on the comparative learning.

A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the above-described contrast learning-based multi-class speech command recognition method.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, the procedures or functions according to the embodiments of the present invention are wholly or partially generated. The computer may be a general purpose computer, a special purpose computer, or other programmable apparatus. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.

Claims

1. A multi-classification voice command recognition method based on comparative learning is characterized by comprising the following steps:

s101, determining k voice command types to be identified according to application requirements, collecting corresponding voice command PCM data according to the voice command types, and forming different voice command data sets X ₁ ；

S102, from each voice command data set X ₁ In the method, a piece of voice command PCM data is randomly extracted by a playback unit to form a piece of full command example data containing k pieces of voice command PCM data;

s103, repeating N rounds S102, wherein k and N are natural numbers larger than 1 to obtain N pieces of full command example data, and creating a full command example data set X containing N X k voice command PCM data through the N pieces of full command example data ₂ ；

S2, training a feature extraction network based on comparison learning; the method specifically comprises the following steps:

s2011, one voice command data set X ₁ The voice command data set X is taken as the whole target training data ₁ Comprising S voice commands and Y categories, the voice command data set X ₁ Each piece of voice command PCM data in (b) is called anchor data x _a Each anchor data x _a The corresponding voice command category is denoted as y _a ；

s2014, mixing y _a Y input as neural network to be trainedEntering;

s2021, converting the vector x = (x) _a ,x _e1 ,x _e2 ,…,x _ek ) Inputting a feature extraction network to be trained, and outputting standard single command voice features z;

s2024, in z _a 、z _ei 、y _a Calculating the loss L for input;

2. The multi-class voice command recognition method based on comparative learning according to claim 1, wherein S2024 is specifically:

first, z is calculated _a And z _ei Similarity between sim (z) _a ，z _ei ):

；

Wherein i is an integer between 1 and k,

denotes z _a J is the number of the middle, and l is the dimensionality;

Wherein:

means if and only if

Calculating;

is a temperature coefficient and is a decimal constant between 0 and 1;

defining: z is a radical of formula _e For z in a single piece of training data _e1 、z _e2 、…、z _ek A general term of (2), a single piece of training data z _a And z _e The total loss of contrast between L is:

。

3. the method for multi-class voice command recognition based on comparative learning according to claim 2, wherein S203 specifically comprises: let M ₀ PCM data count for a batch of voice commands, M per transaction ₀ Individual voice command or all voice commands for a one-dimensional volumeUpdating the primary weight by the product neural network; voice command data set X ₁ One pass of all command processing is defined as an epoch, and training is terminated when the number of epochs trained reaches a threshold E.

4. The multi-classification voice command recognition method based on the comparative learning according to claim 3, wherein S3 is specifically:

s301, establishing a full command sample feature set Z _C ；

Sample the full command into a voice data set X ₂ The commands in the system are input into the feature extraction network one by one, and output results are collected according to the original sequence to obtain a full command sample feature set Z _C ；

S302, utilizing a full command to sample a feature set Z _C And a feature extraction network for identifying the command type C; the method specifically comprises the following steps:

Command x to be recognized _m0 After the trained feature extraction network processing, outputting the command feature z to be recognized _m0 (ii) a Command feature to be recognized z _m0 And comparison set Z _T After n x k characteristics are compared, the characteristic z of the command to be identified is obtained _m0 And comparison set Z _T A feature distance collection D; the command feature to be recognized z _m0 And from the comparison set Z _T Example feature command z of random fetch _(ii，jj) The distance between them is calculated by the formula:

；

after averaging by class within a class, k average distances d are calculated ₁ ，d ₂ ,…,d _k Finding out the minimum value and determining the subscript C; the output C, C is the recognized command.

5. A system for multi-class speech command recognition based on comparative learning, comprising:

firstly, according to application requirements, k voice command categories to be identified are determined, corresponding voice command PCM data are collected according to the voice command categories, and different voice command data sets X are formed ₁ ；

repeating N rounds of random drawing with playback, wherein k and N are natural numbers larger than 1 to obtain N pieces of full command example data, and creating a full command example data set X containing N X k pieces of voice command PCM data through the N pieces of full command example data ₂ ；

s2011, one voice command data set X ₁ The voice command data set X is taken as the whole target training data ₁ Comprising S voice commands and Y categories, said voice command data set X ₁ Each piece of voice command PCM data in (b) is called anchor data x _a Each anchor data x _a The corresponding voice command category is denoted as y _a ；

s2014, mixing y _a As the y input of the neural network to be trained;

s2021, mixing the vector x = (x) _a ,x _e1 ,x _e2 ,…,x _ek ) Inputting a feature extraction network to be trained, and outputting a standard single command voice feature z;

s2024, with z _a 、z _ei 、y _a Calculating the loss L as input;

6. The system according to claim 5, wherein S2024 is specifically configured to:

first, z is calculated _a And z _ei Similarity between sim (z) _a ，z _ei )：

；

Wherein i is an integer between 1 and k,

denotes z _a J is the number of the middle, l is the dimension;

Wherein:

means if and only if

Calculating;

is a temperature coefficient and is a decimal constant between 0 and 1;

defining: z is a radical of _e For z in a single piece of training data _e1 、z _e2 、…、z _ek A general term of (2), a single piece of training data z _a And z _e The total loss of contrast between L is:

。

7. the system for multi-class voice command recognition based on comparative learning according to claim 6, wherein S203 is specifically: let M ₀ PCM data count for a batch of voice commands, M per transaction ₀ Individual voice command or all voice commands for a one-dimensional volumeUpdating the primary weight by the product neural network; voice command data set X ₁ One pass of all command processing is defined as an epoch, and training is terminated when the number of epochs trained reaches a threshold E.

8. The system according to claim 7, wherein S3 is specifically:

s301, establishing a full command sample feature set Z _C ；

Sample the full command into a voice data set X ₂ The commands in the system are input into the feature extraction network one by one, and output results are collected according to the original sequence, namely, a full command sample feature set Z is obtained _C ；

S302, sample feature set Z by using full command _C And a feature extraction network for identifying the command category C; the method comprises the following specific steps:

Command x to be recognized _m0 After the trained feature extraction network processing, outputting the command feature z to be recognized _m0 (ii) a Command feature to be recognized z _m0 And comparison set Z _T After n x k characteristics are compared, the characteristic z of the command to be identified is obtained _m0 And comparison set Z _T A feature distance collection D; the command characteristic z to be recognized _m0 And from the comparison set Z _T Example feature command z of random fetch _(ii，jj) The distance between them is calculated by the formula:

；

9. An information data processing terminal, characterized in that, it is used to implement the multi-classification voice command recognition method based on comparison learning of any claim 1 to 4.

10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method for multi-class speech command recognition based on comparative learning according to any one of claims 1 to 4.