CN115294985A - Multi-classification voice command recognition method and system based on comparative learning - Google Patents

Multi-classification voice command recognition method and system based on comparative learning Download PDF

Info

Publication number
CN115294985A
CN115294985A CN202211219831.9A CN202211219831A CN115294985A CN 115294985 A CN115294985 A CN 115294985A CN 202211219831 A CN202211219831 A CN 202211219831A CN 115294985 A CN115294985 A CN 115294985A
Authority
CN
China
Prior art keywords
command
data
voice
voice command
full
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211219831.9A
Other languages
Chinese (zh)
Other versions
CN115294985B (en
Inventor
戴亦斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Technology Bote Intelligent Technology Co ltd
Original Assignee
Beijing Information Technology Bote Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Technology Bote Intelligent Technology Co ltd filed Critical Beijing Information Technology Bote Intelligent Technology Co ltd
Priority to CN202211219831.9A priority Critical patent/CN115294985B/en
Publication of CN115294985A publication Critical patent/CN115294985A/en
Application granted granted Critical
Publication of CN115294985B publication Critical patent/CN115294985B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a multi-classification voice command recognition method and a multi-classification voice command recognition system based on comparison learning, which belong to the technical field of voice recognition and are characterized by comprising the following steps: s1, constructing a full command example data set X 2 (ii) a S2, training a feature extraction network based on comparison learning; the method comprises the following specific steps: s201, constructing single training input data; s202, processing single training input data in the to-be-trained feature extraction network; s203, updating the weight of the one-dimensional convolutional neural network in a gradient descent mode by taking L as a loss function; and S3, performing voice command recognition by using the feature extraction network. The invention can improve the multi-classification voice through the method for comparing and learning the multi-classification voice commandThe recognition accuracy of the command.

Description

Multi-classification voice command recognition method and system based on comparative learning
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a multi-classification voice command recognition method and system based on comparison learning.
Background
It is well known that in a multi-class speech command recognition process: voice command recognition often requires the collection of a large amount of voice data, and the number of data items for each type of speaker to issue voice commands under various semantic (including various tones, dialects, emotions) conditions in various background environments needs to be balanced. Typically, for a voice command system, there will be a certain set of voice commands, and the number of semantics to be recognized for the commands is assumed to be k. In order to recognize these voice commands, for each command, voices of different speakers who send commands need to be collected, and the diversity of data needs to be considered in the collection, at this time, due to the influence of many factors such as gender, age, accent, dialect, emotion and the like of the speakers, mass data often need to be collected for the same type of command, meanwhile, the data of various types are often unbalanced, and the collected voice command data set is simplified in structure and generally shown in fig. 1. If a voice command sent by a certain type of speaker under a certain special background condition cannot acquire enough data, when the recognition model is used under the condition, model failure phenomena such as detection precision reduction, incapability of recognition and the like can occur.
Disclosure of Invention
The technical purpose is as follows: the invention provides a multi-classification voice command recognition method and a multi-classification voice command recognition system based on comparison learning; the method fully utilizes the unbalanced voice command semantic data acquired by the background, extracts the characteristics of background data which are only related to the general difference between the voice commands and are not related to the specific voice commands through a multi-classification voice command comparison learning method, and utilizes a specially designed generated data sampling strategy to realize data balance and voice command comparison, thereby improving the recognition precision of the multi-classification voice commands.
Technical scheme
The invention provides a multi-classification voice command recognition method based on comparative learning, which comprises the following steps:
s1, constructing a full command example data set X 2 (ii) a The method specifically comprises the following steps:
s101, determining k voice command types to be recognized according to application requirements, collecting corresponding voice command PCM data according to the voice command types, and forming different voice command data sets X 1
S102, from each voice command data set X 1 Randomly extracting a piece of voice command PCM data to form a piece of full command sample data containing k pieces of voice command PCM data;
s103, repeating N rounds S102, k and N being more than 1Obtaining N pieces of full command sample data, and creating a full command sample data set X containing N X k pieces of voice command PCM data through the N pieces of full command sample data 2
S2, training a feature extraction network based on comparison learning; the method comprises the following specific steps:
s201, constructing a single piece of training input data; the method comprises the following steps:
s2011, one voice command data set X 1 The voice command data set X is taken as the whole target training data 1 Comprising S voice commands and Y categories, said voice command data set X 1 Each voice command PCM data in is called anchor data x a Each anchor data x a The corresponding voice command category is denoted as y a
S2012, sample data set X from full command 2 In (2), a piece of full command sample data is randomly extracted in a put-back mode and is marked as x e X is said e K command examples in sequence are noted as x e1 ,x e2 ,…,x ek
S2013, mixing the vector x = (x) a ,x e1 ,x e2 ,…,x ek ) As the x input of the neural network to be trained;
s2014, mixing y a As the y input of the neural network to be trained;
s2015, repeating S2012 to S2014 for S times to obtain S single training input data;
s2016, using voice command data set X 1 Taking the whole S voice command data as objects, executing S2015 by each command, and obtaining S × S single training input data; the single piece of training input data belongs to Y categories;
s202, processing single training input data in the to-be-trained feature extraction network; the method comprises the following steps:
s2021, mixing the vector x = (x) a ,x e1 ,x e2 ,…,x ek ) Inputting a feature extraction network to be trained, and outputting standard single command voice features z;
s2022, for the data item (x) a ,x e1 ,x e2 ,…,x ek ) Performing cubic spline interpolation to obtainStandard command voice data x ' = (x ' of same dimension) ' a ,x’ e1 ,x’ e2 ,…,x’ ek );
S2023, mixing x' a ,x’ e1 ,x’ e2 ,…,x’ ek Sequentially inputting into one-dimensional convolutional neural network to obtain output z a ,z e1 ,z e2 ,…,z ek (ii) a Wherein: z is a radical of a And z ei Are vectors with the same dimension, i is an integer between 1 and k;
s2024, with z a 、z ei 、y a Calculating the loss L for input;
s203, updating the weight of the one-dimensional convolutional neural network in a gradient descent mode by taking L as a loss function;
and S3, performing voice command recognition by using the feature extraction network.
Preferably, S2024 is specifically:
first calculate z a And z ei Similarity between sim (z) a ,z ei ):
Figure 658890DEST_PATH_IMAGE001
Wherein i is an integer between 1 and k,
Figure 310451DEST_PATH_IMAGE002
denotes z a The j-th number;
then z is calculated a And z ei Loss of contrast between L (z) a ,z ei ):
Figure 621347DEST_PATH_IMAGE003
Wherein:
Figure 16556DEST_PATH_IMAGE004
means if and only if
Figure 830928DEST_PATH_IMAGE005
Calculating;
Figure 918970DEST_PATH_IMAGE006
is a temperature coefficient and is a decimal constant between 0 and 1;
defining: z is a radical of e For z in a single piece of training data e1 、z e2 、…、z ek General term of (c), then a single training data z a And z e The total loss of contrast between L is:
Figure 186003DEST_PATH_IMAGE007
preferably, S203 is specifically: let M 0 PCM data count for a batch of voice commands, M per transaction 0 When the processing of each voice command or all voice commands is finished, updating the weight once for the one-dimensional convolution neural network; voice command data set X 1 One pass of all command processing is defined as an epoch, and training is terminated when the number of epochs trained reaches a threshold E.
Preferably, S3 is specifically:
s301, establishing a full command sample feature set Z C (ii) a Sample the full command into a voice data set X 2 The commands in the system are input into the feature extraction network one by one, and output results are collected according to the original sequence, namely, a full command sample feature set Z is obtained C
S302, sample feature set Z by using full command C And a feature extraction network for identifying the command type C; the method comprises the following specific steps:
random sample feature set Z from full command C N pieces of data are selected to form a comparison set Z T
Command x to be recognized m0 After the trained feature extraction network processing, outputting the command feature z to be recognized m0 (ii) a Command characteristic z to be recognized m0 And comparison set Z T After n x k characteristics are compared, the characteristic z of the command to be identified is obtained m0 And comparison set Z T A feature distance collection D; for command feature z to be recognized m0 And the slave comparison set Z T Example feature command z of random fetch (ii,jj) The distance between the two is calculated by the following formula:
Figure 650483DEST_PATH_IMAGE008
wherein ii is a natural number between 1 and n, jj is a natural number between 1 and k, l is a dimension, and p is a natural number between 1 and l;
after averaging by class within a class, k average distances d are calculated for k classes 1 ,d 2 ,…,d k Finding out the minimum value and determining the subscript C; the output C, C is the recognized command.
It is a second object of the present invention to provide a multi-classification voice command recognition system based on contrast learning, comprising:
constructing a module: building a full command example dataset X 2 (ii) a The construction process comprises the following steps:
firstly, according to application requirements, k voice command categories to be recognized are determined, corresponding voice command PCM data are collected according to the voice command categories, and different voice command data sets X are formed 1
And then from each voice command data set X 1 In the method, a piece of voice command PCM data is randomly extracted by a playback unit to form a piece of full command example data containing k pieces of voice command PCM data;
repeating N rounds of random extraction with playback, wherein k and N are natural numbers larger than 1 to obtain N pieces of full command example data, and creating a full command example data set X containing N X k voice command PCM data through the N pieces of full command example data 2
A training module: training a feature extraction network based on comparative learning; the specific process is as follows:
s201, constructing a single piece of training input data; the method comprises the following steps:
s2011, one voice command data set X 1 The voice command data set X is taken as the whole target training data 1 Comprising S voice commands and Y categories, the voice command data set X 1 Each voice command PCM data in is called anchor data x a Number of each anchor pointAccording to x a The corresponding voice command category is denoted as y a
S2012, sample data set X from full command 2 In (2), a piece of full command sample data is randomly extracted in a put-back mode and is marked as x e X is said e K command instances in (a) are sequentially noted as x e1 ,x e2 ,…,x ek
S2013, mixing the vector x = (x) a ,x e1 ,x e2 ,…,x ek ) As the x input of the neural network to be trained;
s2014, mixing y a As the y input of the neural network to be trained;
s2015, repeating S2012 to S2014 for S times to obtain S single training input data;
s2016, using voice command data set X 1 Taking the whole S voice command data as objects, executing S2015 by each command, and obtaining S × S single training input data; the single piece of training input data belongs to Y categories;
s202, processing single training input data in the to-be-trained feature extraction network; the method comprises the following steps:
s2021, mixing the vector x = (x) a ,x e1 ,x e2 ,…,x ek ) Inputting a feature extraction network to be trained, and outputting standard single command voice features z;
s2022, data item (x) a ,x e1 ,x e2 ,…,x ek ) And (5) carrying out cubic spline interpolation to obtain standard command voice data x ' = (x ') with the same dimensionality ' a ,x’ e1 ,x’ e2 ,…,x’ ek );
S2023, mixing x' a ,x’ e1 ,x’ e2 ,…,x’ ek Sequentially inputting into one-dimensional convolutional neural network to obtain output z a ,z e1 ,z e2 ,…,z ek (ii) a Wherein: z is a radical of formula a And z ei Are vectors with the same dimension, i is an integer between 1 and k;
s2024, in z a 、z ei 、y a Calculating the loss L for input;
s203, updating the weight of the one-dimensional convolutional neural network in a gradient descending mode by taking L as a loss function;
an identification module: and performing voice command recognition by using the feature extraction network.
Preferably, S2024 is specifically:
first calculate z a And z ei Similarity between sim (z) a ,z ei ):
Figure 53782DEST_PATH_IMAGE009
Wherein i is an integer between 1 and k,
Figure 578304DEST_PATH_IMAGE010
denotes z a The j-th number;
then z is calculated a And z ei Loss of contrast between L (z) a ,z ei ):
Figure 332634DEST_PATH_IMAGE011
Wherein:
Figure 69646DEST_PATH_IMAGE012
means if and only if
Figure 609343DEST_PATH_IMAGE013
Calculating;
Figure 39187DEST_PATH_IMAGE014
is a temperature coefficient and is a decimal constant between 0 and 1;
defining: z is a radical of e For z in a single piece of training data e1 、z e2 、…、z ek A general term of (1), then a single training data z a And z e The total loss of contrast between L is:
Figure 280812DEST_PATH_IMAGE015
preferably, the first and second liquid crystal display panels are,s203 specifically includes: let M 0 PCM data count for a batch of voice commands, M per transaction 0 When the processing of each voice command or all voice commands is finished, updating the weight once for the one-dimensional convolution neural network; voice command data set X 1 One pass of all command processing is defined as an epoch, and training is terminated when the number of epochs trained reaches a threshold E.
Preferably, S3 is in particular:
s301, establishing a full command sample feature set Z C (ii) a Sample the full command into a voice data set X 2 The commands in the system are input into the feature extraction network one by one, and output results are collected according to the original sequence, namely, a full command sample feature set Z is obtained C
S302, sample feature set Z by using full command C And a feature extraction network for identifying the command category C; the method specifically comprises the following steps:
random sample feature set Z from full command C N pieces of data are selected to form a comparison set Z T
Command x to be recognized m0 After the trained feature extraction network processing, outputting the command feature z to be recognized m0 (ii) a Command feature to be recognized z m0 And comparison set Z T After n x k characteristics are compared, the characteristic z of the command to be identified is obtained m0 And comparison set Z T A feature distance collection D; for command feature z to be recognized m0 And from the comparison set Z T Example feature command z of random fetch (ii,jj) The distance between the two is calculated by the following formula:
Figure 821515DEST_PATH_IMAGE016
wherein ii is a natural number between 1 and n, jj is a natural number between 1 and k, l is a dimension, and p is a natural number between 1 and l;
after averaging by class within a class, k average distances d are calculated for k classes 1 ,d 2 ,…,d k Finding out the minimum value and determining the subscript C; the output C, C is the recognized command.
The third objective of the invention is to provide an information data processing terminal, which is used for realizing the multi-classification voice command recognition method based on comparative learning.
It is a fourth object of the present invention to provide a computer-readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the above-described contrast learning-based multi-class speech command recognition method.
The invention has the advantages and positive effects that:
the invention uses deep learning, fully utilizes the unbalanced voice command semantic data acquired by the background, extracts the characteristics which are only related to the general difference between the voice commands and are not related to the specific voice commands in the background data through a unique multi-classification voice command comparison learning method, and utilizes a specially designed generated data sampling strategy to realize data equalization and voice command comparison, thereby improving the recognition precision of the multi-classification voice commands.
Drawings
FIG. 1 is a set of voice data commands X in a preferred embodiment of the present invention 1 The structure of (1);
FIG. 2 is an exemplary voice data set X for a full command in a preferred embodiment of the present invention 2 The construction flow chart of (1);
FIG. 3 is an exemplary voice data set X for a full command in a preferred embodiment of the present invention 2 The structure of (1);
FIG. 4 is a flow chart of the training of a feature extraction network in a preferred embodiment of the present invention;
FIG. 5 is an exemplary feature set Z for a full command in a preferred embodiment of the invention C The construction flow chart of (1);
fig. 6 is a flow chart of the operation of the feature extraction network in the preferred embodiment of the present invention.
Detailed Description
For a further understanding of the contents, features and effects of the invention, reference should be made to the following examples, which are set forth in the following detailed description and are to be read in conjunction with the accompanying drawings.
The invention mainly solves a key problem in the field of speech recognition: voice command recognition often requires the collection of a large amount of voice data, and the number of data items for each type of speaker to issue voice commands under various semantic (including various tones, dialects, emotions) conditions in various background environments needs to be balanced. With a few examples of voice commands covering all command categories, accurate recognition of command semantics has always been a major difficulty in voice recognition.
Referring to fig. 1 to 6, a multi-class voice command recognition method based on comparative learning includes:
s1, constructing a full command example data set X 2 (ii) a As shown in fig. 2, the method specifically includes:
s101, firstly, determining the type of a voice command to be recognized according to the requirements of specific applications, and collecting corresponding voice command PCM data according to the type of the command, wherein each piece of voice command PCM data is collected under the corresponding type to form different type sets. The number of data pieces under each category set may be different. It is assumed here that the number of voice command classes is k, which is self-defined by the user. Voice command data set X 1
S102, under k category sets, from each voice command data set X 1 In the method, a piece of voice command PCM data is randomly extracted to form a piece of full command example data, namely, the full command example data containing k pieces of voice command PCM data is formed;
s103, repeating the N rounds S102 to form a full command example data set X containing N pieces of full command example data 2 . Wherein k and N are natural numbers larger than 1, N pieces of full command example data are used in total, and a full command example data set X containing N X k voice command PCM data is created by the N pieces of full command example data 2 As shown in fig. 3;
s2, full command example data set X established in S1 2 As shown in fig. 4: training a feature extraction network based on comparative learning; the method specifically comprises the following steps:
s201, constructing single training input data; the method comprises the following steps:
s2011, one voice command data set X 1 As global target training data, assume the voice command data set X 1 Including S voice commands and Y categories, for a voice command data set X 1 In each caseA piece of speech command PCM data is treated as data to be trained, called anchor data x a Each piece of anchor data belongs to a command class, denoted as y a . Each anchor data x a The corresponding voice command category is denoted as y a
S2012, sample data set X from full command 2 In (2), a piece of full command sample data is randomly extracted in a put-back mode and is marked as x e Obviously, x e Contains k command instances, x e K command examples in sequence are noted as x e1 ,x e2 ,…,x ek
S2013, converting the vector x = (x) a ,x e1 ,x e2 ,…,x ek ) As the x input of the neural network to be trained;
s2014, mixing y a As the y input of the neural network to be trained;
s2015, repeating S2012 to S2014 a fixed number of times, the size of the fixed number being determined by the user, where the fixed number is denoted as S. After the repetition is completed, will get x a Inputting data for s individual trainings of an anchor point;
s2016, using voice command data set X 1 Taking the whole S voice command data as objects, executing S2015 by each command, and obtaining S × S single training input data; are classified into Y categories;
s202, processing single training input data in the to-be-trained feature extraction network; the method comprises the following steps:
s2021, the to-be-trained feature extraction network takes the vector x as input, and standard single command voice features z are output after processing. I.e. vector x = (x) a ,x e1 ,x e2 ,…,x ek ) Inputting a feature extraction network to be trained, and outputting a standard single command voice feature z;
s2022, after the vector x is input, each data item (x) in the vector x is input a ,x e1 ,x e2 ,…,x ek ) Performing cubic spline interpolation (cubiccspline) calculation conversion to obtain a one-dimensional array with the same dimension, and recording the one-dimensional array as standard command voice data x '= (x' a ,x’ e1 ,x’ e2 ,…,x’ ek ). Three timesSpline interpolation computation process design and hyper-parameter selection are decided by users.
S2023, mixing x' a ,x’ e1 ,x’ e2 ,…,x’ ek Sequentially inputting the data to a one-dimensional convolution neural network and sequentially obtaining output z a ,z e1 ,z e2 ,…,z ek (ii) a Wherein: z is a radical of formula a And z ei Are vectors with the same dimension, i is an integer between 1 and k; the number of dimensions here is l. The design of the one-dimensional convolution neural network and the selection of the hyper-parameters are decided by a user, and the input dimension of the one-dimensional convolution neural network is only required to be kept consistent with the output dimension calculated by the cubic spline interpolation.
S2024, in z a 、z ei 、y a Calculating the loss L as input; the calculation method is as follows:
first calculate z a And z ei Similarity between sim (z) a ,z ei ):
Figure 730565DEST_PATH_IMAGE017
Wherein i is an integer between 1 and k,
Figure 65732DEST_PATH_IMAGE018
denotes z a The j-th number;
then, z is calculated a And z ei Loss of contrast between L (z) a ,z ei ):
Figure 794653DEST_PATH_IMAGE019
Wherein:
Figure 139047DEST_PATH_IMAGE020
means if and only if
Figure 637024DEST_PATH_IMAGE021
And (6) performing calculation.
Figure 408671DEST_PATH_IMAGE022
Is a temperature coefficient and is a decimal constant between 0 and 1, and is determined by a user.
Thus, a single piece of training data z a And z e The total loss of contrast between L is:
Figure 624889DEST_PATH_IMAGE023
s203, updating the weight of the one-dimensional convolutional neural network in a gradient descent mode by taking L as a loss function;
let M 0 The number is a batch order number, which is determined by the user and can be set to 64 or 128. M Per treatment 0 And when the processing of each command or all commands is finished, updating the weight value of the one-dimensional convolutional neural network once. Voice command data set X 1 All command processing passes in (1) are defined as an epoch. The user may define an epoch threshold E and when the number of epochs trained reaches E, the training is terminated.
S3, voice command recognition is carried out by using a feature extraction network; the method specifically comprises the following steps:
s301, as shown in fig. 5: establishing a full command example feature set Z C
Sample data set X with the full command established in S1 2 The commands in the system are input into the trained feature extraction network one by one, and the output results are collected according to the original sequence to obtain a full command sample feature set Z C
S302, as shown in FIG. 6: sample feature set Z with full command C The trained feature extraction network identifies the command type C; the method specifically comprises the following steps:
random sample feature set Z from full command C N pieces of data (note that according to the description of S301, each full command example feature should contain k commands, and in the same full command example feature, each command example feature should not be repeated and not leaked, and exactly corresponds to one type of command voice) are selected to form a comparison set Z T
The voice command sent by the end user (client) is set as a command x to be recognized m0 The command is extracted through the trained feature extraction networkAfter processing, the result is the command characteristic z to be recognized m0 . Command feature to be recognized z m0 And comparison set Z T After n x k characteristics are compared (distance is calculated), the command characteristic z to be recognized is obtained m0 And comparison set Z T D is the set of distances of each feature, D is from D (1,1) To d (k,n) N x k distance calculations. For command feature z to be recognized m0 Set Z compared with arbitrary slaves T Example feature Command taken in (set to z) (ii,jj) ) The distance between the two is calculated by the following formula:
Figure 772974DEST_PATH_IMAGE024
wherein ii is a natural number between 1 and n (including 1 and n), jj is a natural number between 1 and k (including 1 and k), l is a dimension, and p is a natural number between 1 and l (including 1 and l).
After averaging by category within a category, k average distances d can be calculated for k categories 1 ,d 2 ,…,d k . The minimum value can be found at this time and its index C can be determined.
The output C, C is the recognized command.
A comparative learning based multi-class speech command recognition system comprising:
constructing a module: building a full command example dataset X 2 (ii) a As shown in fig. 2, the method specifically includes:
s101, firstly, determining the type of a voice command to be recognized according to the requirements of specific applications, and collecting corresponding voice command PCM data according to the type of the command, wherein each piece of voice command PCM data is collected under the corresponding type to form different type sets. The number of data pieces under each category set may be different. It is assumed here that the number of classes of voice commands is k, which is self-defined by the user. Voice command data set X 1
S102, under k category sets, from each voice command data set X 1 In the method, a piece of PCM data of a voice command is randomly extracted to form a piece of full command sample data, namely a piece of PCM data containing k pieces of voice is formedFull command example data that commands PCM data;
s103, repeating the N rounds S102 to form a full command example data set X containing N pieces of full command example data 2 . Wherein k and N are natural numbers larger than 1, N pieces of full command example data are totally generated, and a full command example data set X containing N X k voice command PCM data is created through the N pieces of full command example data 2 As shown in fig. 3;
a training module: full command example dataset X built in building Module 2 As shown in fig. 4: training a feature extraction network based on comparative learning; the method specifically comprises the following steps:
s201, constructing a single piece of training input data; the method comprises the following steps:
s2011, one voice command data set X 1 As global target training data, assume the voice command data set X 1 Including S voice commands and Y categories, for a voice command data set X 1 Wherein each voice command PCM data is treated as data to be trained, called anchor data x a Each piece of anchor data belongs to a command class, denoted as y a . Each anchor data x a The corresponding voice command category is denoted as y a
S2012, sample data set X from full command 2 In (2), a play-back randomly extracts a full command sample data, marked as x e Obviously, x e Contains k command instances, x e K command examples in sequence are noted as x e1 ,x e2 ,…,x ek
S2013, mixing the vector x = (x) a ,x e1 ,x e2 ,…,x ek ) As the x input of the neural network to be trained;
s2014, mixing y a As the y input of the neural network to be trained;
s2015, repeating S2012 to S2014 a fixed number of times, the size of the fixed number being determined by the user, where the fixed number is denoted as S. After the repetition is completed, the result will be x a S single training input data for the anchor point;
s2016, using voice command data set X 1 All-in-one S-bar speechThe command data is used as an object, each command executes S2015, and S × S single training input data are obtained; the single piece of training input data belongs to Y categories;
s202, processing single training input data in the to-be-trained feature extraction network; the method comprises the following steps:
s2021, the to-be-trained feature extraction network takes the vector x as input, and standard single command voice features z are output after processing. I.e. vector x = (x) a ,x e1 ,x e2 ,…,x ek ) Inputting a feature extraction network to be trained, and outputting standard single command voice features z;
s2022, after the vector x is input, each data item (x) in the vector x is input a ,x e1 ,x e2 ,…,x ek ) Performing cubic spline interpolation (cubiccspline) calculation conversion to obtain a one-dimensional array with the same dimension, and recording the one-dimensional array as standard command voice data x '= (x' a ,x’ e1 ,x’ e2 ,…,x’ ek ). The cubic spline interpolation calculation process design and the hyper-parameter selection are decided by users.
S2023, mixing x' a ,x’ e1 ,x’ e2 ,…,x’ ek Sequentially inputting the data to a one-dimensional convolution neural network to sequentially obtain output z a ,z e1 ,z e2 ,…,z ek (ii) a Wherein: z is a radical of a And z ei Are vectors with the same dimension, i is an integer between 1 and k; the number of dimensions here is l. The design of the one-dimensional convolution neural network and the selection of the hyper-parameters are decided by a user, and the input dimension of the one-dimensional convolution neural network is only required to be kept consistent with the output dimension calculated by the cubic spline interpolation.
S2024, in z a 、z ei 、y a Calculating the loss L as input; the calculation method is as follows:
first calculate z a And z ei Similarity between sim (z) a ,z ei ):
Figure 125457DEST_PATH_IMAGE025
WhereinI is an integer of 1 to k,
Figure 68006DEST_PATH_IMAGE026
denotes z a J is the number of the middle, l is the dimension;
then, z is calculated a And z ei Loss of contrast between L (z) a ,z ei ):
Figure 771519DEST_PATH_IMAGE027
Wherein the content of the first and second substances,
Figure 192137DEST_PATH_IMAGE028
means if and only if
Figure 930285DEST_PATH_IMAGE029
And (6) performing calculation.
Figure 778156DEST_PATH_IMAGE030
Is a temperature coefficient and is a decimal constant between 0 and 1, and is determined by a user.
Thus, a single piece of training data z a And z e The total loss of contrast between L is:
Figure 229952DEST_PATH_IMAGE031
s203, updating the weight of the one-dimensional convolutional neural network in a gradient descent mode by taking L as a loss function;
let M 0 The number is a batch order number, which is determined by the user and can be set to 64 or 128. M per treatment 0 And when the processing of each command or all commands is finished, updating the weight value of the one-dimensional convolutional neural network once. Voice command data set X 1 All command processing passes in (1) are defined as an epoch. The user may define an epoch threshold E and when the number of epochs of training reaches E, the training is terminated.
An identification module: performing voice command recognition by using a feature extraction network; the method specifically comprises the following steps:
s301, as shown in FIG. 5: establishing a full command sample feature set Z C
Sample data set X of full commands established in building Block 2 The commands in the system are input into the trained feature extraction network one by one, and output results of the commands are collected according to the original sequence to obtain a full command sample feature set Z C
S302, as shown in FIG. 6: using full command exemplary feature set Z C The trained feature extraction network identifies the command type C; the method specifically comprises the following steps:
random slave full command example feature set Z C N pieces of data (note that according to the description of S301, each full command example feature should contain k commands, and in the same full command example feature, each command example feature should not be repeated and not leaked to exactly correspond to a type of command voice) are selected to form a comparison set Z T
The voice command sent by the end user (client) is set as a command x to be recognized m0 After the command is processed by the trained feature extraction network, the result is the feature z of the command to be recognized m0 . Command characteristic z to be recognized m0 And comparison set Z T After n × k features are compared (distance is calculated), the command feature z to be recognized is obtained m0 And comparison set Z T D is the set of distances of each feature, D is from D (1,1) To d (k,n) A total of n x k distance calculations. For command feature z to be recognized m0 Set Z of comparisons with arbitrary slaves T Example feature Command taken in (set to z) (ii,jj) ) The distance between the two is calculated by the formula:
Figure 454260DEST_PATH_IMAGE032
wherein ii is a natural number between 1 and n (including 1 and n), jj is a natural number between 1 and k (including 1 and k), l is a dimension, and p is a natural number between 1 and l (including 1 and l).
After averaging by category within a category, k average distances d can be calculated for k categories 1 ,d 2 ,…,d k . At this time, can find outAnd determining its subscript C.
The output C, C is the recognized command.
An information data processing terminal is used for realizing the multi-classification voice command recognition method based on the comparative learning.
A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the above-described contrast learning-based multi-class speech command recognition method.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, the procedures or functions according to the embodiments of the present invention are wholly or partially generated. The computer may be a general purpose computer, a special purpose computer, or other programmable apparatus. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.

Claims (10)

1. A multi-classification voice command recognition method based on comparative learning is characterized by comprising the following steps:
s1, constructing a full command example data set X 2 (ii) a The method specifically comprises the following steps:
s101, determining k voice command types to be identified according to application requirements, collecting corresponding voice command PCM data according to the voice command types, and forming different voice command data sets X 1
S102, from each voice command data set X 1 In the method, a piece of voice command PCM data is randomly extracted by a playback unit to form a piece of full command example data containing k pieces of voice command PCM data;
s103, repeating N rounds S102, wherein k and N are natural numbers larger than 1 to obtain N pieces of full command example data, and creating a full command example data set X containing N X k voice command PCM data through the N pieces of full command example data 2
S2, training a feature extraction network based on comparison learning; the method specifically comprises the following steps:
s201, constructing a single piece of training input data; the method comprises the following steps:
s2011, one voice command data set X 1 The voice command data set X is taken as the whole target training data 1 Comprising S voice commands and Y categories, the voice command data set X 1 Each piece of voice command PCM data in (b) is called anchor data x a Each anchor data x a The corresponding voice command category is denoted as y a
S2012, sample data set X from full command 2 In (2), a piece of full command sample data is randomly extracted in a put-back mode and is marked as x e X is said e K command examples in sequence are noted as x e1 ,x e2 ,…,x ek
S2013, mixing the vector x = (x) a ,x e1 ,x e2 ,…,x ek ) As the x input of the neural network to be trained;
s2014, mixing y a Y input as neural network to be trainedEntering;
s2015, repeating S2012 to S2014 for S times to obtain S single training input data;
s2016, using voice command data set X 1 Taking the whole S voice command data as objects, executing S2015 by each command, and obtaining S × S single training input data; the single piece of training input data belongs to Y categories;
s202, processing single training input data in the to-be-trained feature extraction network; the method comprises the following steps:
s2021, converting the vector x = (x) a ,x e1 ,x e2 ,…,x ek ) Inputting a feature extraction network to be trained, and outputting standard single command voice features z;
s2022, data item (x) a ,x e1 ,x e2 ,…,x ek ) And (5) carrying out cubic spline interpolation to obtain standard command voice data x ' = (x ') with the same dimensionality ' a ,x’ e1 ,x’ e2 ,…,x’ ek );
S2023, mixing x' a ,x’ e1 ,x’ e2 ,…,x’ ek Sequentially inputting into one-dimensional convolutional neural network to obtain output z a ,z e1 ,z e2 ,…,z ek (ii) a Wherein: z is a radical of formula a And z ei Are vectors with the same dimension, i is an integer between 1 and k;
s2024, in z a 、z ei 、y a Calculating the loss L for input;
s203, updating the weight of the one-dimensional convolutional neural network in a gradient descent mode by taking L as a loss function;
and S3, performing voice command recognition by using the feature extraction network.
2. The multi-class voice command recognition method based on comparative learning according to claim 1, wherein S2024 is specifically:
first, z is calculated a And z ei Similarity between sim (z) a ,z ei ):
Figure 331966DEST_PATH_IMAGE001
Wherein i is an integer between 1 and k,
Figure 299922DEST_PATH_IMAGE002
denotes z a J is the number of the middle, and l is the dimensionality;
then, z is calculated a And z ei Loss of contrast between L (z) a ,z ei ):
Figure 661633DEST_PATH_IMAGE003
Wherein:
Figure 373237DEST_PATH_IMAGE004
means if and only if
Figure 504004DEST_PATH_IMAGE005
Calculating;
Figure 642861DEST_PATH_IMAGE006
is a temperature coefficient and is a decimal constant between 0 and 1;
defining: z is a radical of formula e For z in a single piece of training data e1 、z e2 、…、z ek A general term of (2), a single piece of training data z a And z e The total loss of contrast between L is:
Figure 491869DEST_PATH_IMAGE007
3. the method for multi-class voice command recognition based on comparative learning according to claim 2, wherein S203 specifically comprises: let M 0 PCM data count for a batch of voice commands, M per transaction 0 Individual voice command or all voice commands for a one-dimensional volumeUpdating the primary weight by the product neural network; voice command data set X 1 One pass of all command processing is defined as an epoch, and training is terminated when the number of epochs trained reaches a threshold E.
4. The multi-classification voice command recognition method based on the comparative learning according to claim 3, wherein S3 is specifically:
s301, establishing a full command sample feature set Z C
Sample the full command into a voice data set X 2 The commands in the system are input into the feature extraction network one by one, and output results are collected according to the original sequence to obtain a full command sample feature set Z C
S302, utilizing a full command to sample a feature set Z C And a feature extraction network for identifying the command type C; the method specifically comprises the following steps:
random sample feature set Z from full command C N pieces of data are selected to form a comparison set Z T
Command x to be recognized m0 After the trained feature extraction network processing, outputting the command feature z to be recognized m0 (ii) a Command feature to be recognized z m0 And comparison set Z T After n x k characteristics are compared, the characteristic z of the command to be identified is obtained m0 And comparison set Z T A feature distance collection D; the command feature to be recognized z m0 And from the comparison set Z T Example feature command z of random fetch (ii,jj) The distance between them is calculated by the formula:
Figure 741584DEST_PATH_IMAGE008
wherein ii is a natural number between 1 and n, jj is a natural number between 1 and k, l is a dimension, and p is a natural number between 1 and l;
after averaging by class within a class, k average distances d are calculated 1 ,d 2 ,…,d k Finding out the minimum value and determining the subscript C; the output C, C is the recognized command.
5. A system for multi-class speech command recognition based on comparative learning, comprising:
constructing a module: building a full command example dataset X 2 (ii) a The construction process comprises the following steps:
firstly, according to application requirements, k voice command categories to be identified are determined, corresponding voice command PCM data are collected according to the voice command categories, and different voice command data sets X are formed 1
And then from each voice command data set X 1 In the method, a piece of voice command PCM data is randomly extracted by a playback unit to form a piece of full command example data containing k pieces of voice command PCM data;
repeating N rounds of random drawing with playback, wherein k and N are natural numbers larger than 1 to obtain N pieces of full command example data, and creating a full command example data set X containing N X k pieces of voice command PCM data through the N pieces of full command example data 2
A training module: training a feature extraction network based on comparative learning; the specific process is as follows:
s201, constructing a single piece of training input data; the method comprises the following steps:
s2011, one voice command data set X 1 The voice command data set X is taken as the whole target training data 1 Comprising S voice commands and Y categories, said voice command data set X 1 Each piece of voice command PCM data in (b) is called anchor data x a Each anchor data x a The corresponding voice command category is denoted as y a
S2012, sample data set X from full command 2 In (2), a piece of full command sample data is randomly extracted in a put-back mode and is marked as x e X is said e K command instances in (a) are sequentially noted as x e1 ,x e2 ,…,x ek
S2013, mixing the vector x = (x) a ,x e1 ,x e2 ,…,x ek ) As the x input of the neural network to be trained;
s2014, mixing y a As the y input of the neural network to be trained;
s2015, repeating S2012 to S2014 for S times to obtain S single training input data;
s2016, using voice command data set X 1 Taking the whole S voice command data as objects, executing S2015 by each command, and obtaining S × S single training input data; the single piece of training input data belongs to Y categories;
s202, processing single training input data in the to-be-trained feature extraction network; the method comprises the following steps:
s2021, mixing the vector x = (x) a ,x e1 ,x e2 ,…,x ek ) Inputting a feature extraction network to be trained, and outputting a standard single command voice feature z;
s2022, data item (x) a ,x e1 ,x e2 ,…,x ek ) And (5) carrying out cubic spline interpolation to obtain standard command voice data x ' = (x ') with the same dimensionality ' a ,x’ e1 ,x’ e2 ,…,x’ ek );
S2023, mixing x' a ,x’ e1 ,x’ e2 ,…,x’ ek Sequentially inputting into one-dimensional convolutional neural network to obtain output z a ,z e1 ,z e2 ,…,z ek (ii) a Wherein: z is a radical of a And z ei Are vectors with the same dimension, i is an integer between 1 and k;
s2024, with z a 、z ei 、y a Calculating the loss L as input;
s203, updating the weight of the one-dimensional convolutional neural network in a gradient descending mode by taking L as a loss function;
an identification module: and performing voice command recognition by using the feature extraction network.
6. The system according to claim 5, wherein S2024 is specifically configured to:
first, z is calculated a And z ei Similarity between sim (z) a ,z ei ):
Figure 726858DEST_PATH_IMAGE009
Wherein i is an integer between 1 and k,
Figure 302196DEST_PATH_IMAGE010
denotes z a J is the number of the middle, l is the dimension;
then z is calculated a And z ei Loss of contrast between L (z) a ,z ei ):
Figure 372920DEST_PATH_IMAGE011
Wherein:
Figure 442638DEST_PATH_IMAGE012
means if and only if
Figure 282418DEST_PATH_IMAGE013
Calculating;
Figure 763078DEST_PATH_IMAGE014
is a temperature coefficient and is a decimal constant between 0 and 1;
defining: z is a radical of e For z in a single piece of training data e1 、z e2 、…、z ek A general term of (2), a single piece of training data z a And z e The total loss of contrast between L is:
Figure 321098DEST_PATH_IMAGE015
7. the system for multi-class voice command recognition based on comparative learning according to claim 6, wherein S203 is specifically: let M 0 PCM data count for a batch of voice commands, M per transaction 0 Individual voice command or all voice commands for a one-dimensional volumeUpdating the primary weight by the product neural network; voice command data set X 1 One pass of all command processing is defined as an epoch, and training is terminated when the number of epochs trained reaches a threshold E.
8. The system according to claim 7, wherein S3 is specifically:
s301, establishing a full command sample feature set Z C
Sample the full command into a voice data set X 2 The commands in the system are input into the feature extraction network one by one, and output results are collected according to the original sequence, namely, a full command sample feature set Z is obtained C
S302, sample feature set Z by using full command C And a feature extraction network for identifying the command category C; the method comprises the following specific steps:
random sample feature set Z from full command C N pieces of data are selected to form a comparison set Z T
Command x to be recognized m0 After the trained feature extraction network processing, outputting the command feature z to be recognized m0 (ii) a Command feature to be recognized z m0 And comparison set Z T After n x k characteristics are compared, the characteristic z of the command to be identified is obtained m0 And comparison set Z T A feature distance collection D; the command characteristic z to be recognized m0 And from the comparison set Z T Example feature command z of random fetch (ii,jj) The distance between them is calculated by the formula:
Figure 178196DEST_PATH_IMAGE016
wherein ii is a natural number between 1 and n, jj is a natural number between 1 and k, l is a dimension, and p is a natural number between 1 and l;
after averaging by class within a class, k average distances d are calculated 1 ,d 2 ,…,d k Finding out the minimum value and determining the subscript C; the output C, C is the recognized command.
9. An information data processing terminal, characterized in that, it is used to implement the multi-classification voice command recognition method based on comparison learning of any claim 1 to 4.
10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method for multi-class speech command recognition based on comparative learning according to any one of claims 1 to 4.
CN202211219831.9A 2022-10-08 2022-10-08 Multi-classification voice command recognition method and system based on comparative learning Active CN115294985B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211219831.9A CN115294985B (en) 2022-10-08 2022-10-08 Multi-classification voice command recognition method and system based on comparative learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211219831.9A CN115294985B (en) 2022-10-08 2022-10-08 Multi-classification voice command recognition method and system based on comparative learning

Publications (2)

Publication Number Publication Date
CN115294985A true CN115294985A (en) 2022-11-04
CN115294985B CN115294985B (en) 2022-12-09

Family

ID=83834177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211219831.9A Active CN115294985B (en) 2022-10-08 2022-10-08 Multi-classification voice command recognition method and system based on comparative learning

Country Status (1)

Country Link
CN (1) CN115294985B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210089724A1 (en) * 2019-09-25 2021-03-25 Google Llc Contrastive Pre-Training for Language Tasks
CN113239903A (en) * 2021-07-08 2021-08-10 中国人民解放军国防科技大学 Cross-modal lip reading antagonism dual-contrast self-supervision learning method
CN113593611A (en) * 2021-07-26 2021-11-02 平安科技(深圳)有限公司 Voice classification network training method and device, computing equipment and storage medium
US20220191305A1 (en) * 2020-12-11 2022-06-16 International Business Machines Corporation Identifying a voice command boundary
CN114648982A (en) * 2022-05-24 2022-06-21 四川大学 Controller voice recognition method and device based on comparative learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210089724A1 (en) * 2019-09-25 2021-03-25 Google Llc Contrastive Pre-Training for Language Tasks
US20220191305A1 (en) * 2020-12-11 2022-06-16 International Business Machines Corporation Identifying a voice command boundary
CN113239903A (en) * 2021-07-08 2021-08-10 中国人民解放军国防科技大学 Cross-modal lip reading antagonism dual-contrast self-supervision learning method
CN113593611A (en) * 2021-07-26 2021-11-02 平安科技(深圳)有限公司 Voice classification network training method and device, computing equipment and storage medium
CN114648982A (en) * 2022-05-24 2022-06-21 四川大学 Controller voice recognition method and device based on comparative learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张文林 等: "基于正样本对比与掩蔽重建的自监督语音表示学习", 《通信学报》 *
赵彩光等: "基于改进对比散度的GRBM语音识别", 《计算机工程》 *

Also Published As

Publication number Publication date
CN115294985B (en) 2022-12-09

Similar Documents

Publication Publication Date Title
Cai et al. A novel learnable dictionary encoding layer for end-to-end language identification
CN110457432B (en) Interview scoring method, interview scoring device, interview scoring equipment and interview scoring storage medium
Lozano-Diez et al. An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition
US10592607B2 (en) Iterative alternating neural attention for machine reading
CN111930914B (en) Problem generation method and device, electronic equipment and computer readable storage medium
CN109767787A (en) Emotion identification method, equipment and readable storage medium storing program for executing
CN111862984A (en) Signal input method and device, electronic equipment and readable storage medium
WO2022252636A1 (en) Artificial intelligence-based answer generation method and apparatus, device, and storage medium
JP7332024B2 (en) Recognition device, learning device, method thereof, and program
CN114579743A (en) Attention-based text classification method and device and computer readable medium
CN112632248A (en) Question answering method, device, computer equipment and storage medium
KR20200041199A (en) Method, apparatus and computer-readable medium for operating chatbot
CN114022192A (en) Data modeling method and system based on intelligent marketing scene
JP7329393B2 (en) Audio signal processing device, audio signal processing method, audio signal processing program, learning device, learning method and learning program
Zheng et al. Contrastive auto-encoder for phoneme recognition
CN116775873A (en) Multi-mode dialogue emotion recognition method
CN115294985B (en) Multi-classification voice command recognition method and system based on comparative learning
CN114882888A (en) Voiceprint recognition method and system based on variational self-coding and countermeasure generation network
CN109872721A (en) Voice authentication method, information processing equipment and storage medium
TW202314579A (en) Machine reading comprehension apparatus and method
JP6728083B2 (en) Intermediate feature amount calculation device, acoustic model learning device, speech recognition device, intermediate feature amount calculation method, acoustic model learning method, speech recognition method, program
CN114328923A (en) Citation intention classification method based on multi-task bilateral branch network
Widhi et al. Implementation Of Deep Learning For Fake News Classification In Bahasa Indonesia
Lee et al. Improved model adaptation approach for recognition of reduced-frame-rate continuous speech
Rajendran et al. RETRACTED ARTICLE: Preserving learnability and intelligibility at the point of care with assimilation of different speech recognition techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: Room 01, B03, 4th Floor, No. 17 Guangshun North Street, Chaoyang District, Beijing, 100102

Patentee after: Beijing Information Technology Bote Intelligent Technology Co.,Ltd.

Address before: 100089 602-4, 6th floor, building 3, 11 Changchun Bridge Road, Haidian District, Beijing

Patentee before: Beijing Information Technology Bote Intelligent Technology Co.,Ltd.

CP02 Change in the address of a patent holder