CN114495935A - Voice control method and system of intelligent device and electronic device - Google Patents
Voice control method and system of intelligent device and electronic device Download PDFInfo
- Publication number
- CN114495935A CN114495935A CN202210235784.0A CN202210235784A CN114495935A CN 114495935 A CN114495935 A CN 114495935A CN 202210235784 A CN202210235784 A CN 202210235784A CN 114495935 A CN114495935 A CN 114495935A
- Authority
- CN
- China
- Prior art keywords
- emotion
- voice
- vector
- information
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 239000013598 vector Substances 0.000 claims abstract description 458
- 230000008451 emotion Effects 0.000 claims abstract description 239
- 238000012545 processing Methods 0.000 claims abstract description 48
- 230000011218 segmentation Effects 0.000 claims abstract description 23
- 230000002996 emotional effect Effects 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000007246 mechanism Effects 0.000 abstract description 7
- 230000008909 emotion recognition Effects 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000000873 masking effect Effects 0.000 description 3
- 239000003086 colorant Substances 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Signal Processing (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The application relates to the field of voice control of intelligent equipment, and particularly discloses a voice control method, a voice control system and electronic equipment of the intelligent equipment, wherein physiological signals are adopted to carry out auxiliary emotion recognition, namely, the voice information is subjected to word segmentation and word embedding processing, the emotion information is combined together through a context-based semantic understanding model to obtain respective feature vectors, the emotion feature vectors are further taken as reference vectors, a decoder attention mechanism is applied, and the sequence processing of the voice feature vectors is the context feature vectors based on the emotion feature vectors, so that subsequent classification control instructions are more accurate. Like this, can carry out intelligent control to intelligent domestic appliance better to bring more comfortable use for people and experience.
Description
Technical Field
The present invention relates to the field of voice control of smart devices, and more particularly, to a voice control method and system for a smart device and an electronic device.
Background
With the rapid development of economy and the continuous improvement of the living standard of people, more and more electronic devices represented by household appliances enter common people, the electronic devices meet various requirements of people, the living colors of people are enriched, and the living quality of people is greatly improved.
However, the operation of these electrical appliances requires frequent manual switch-on, which brings inconvenience to people and is not in accordance with the requirements of smart homes.
Some manufacturers solve the control problem of the smart home through voice control, namely, recognize a voice instruction provided by a user and then control the voice instruction. However, there still exist some problems, one of which is that the existing voice control only utilizes semantic information of voice, and does not utilize mode information expressed in voice, for example, emotion information, such as when a user wants to increase the volume of a smart speaker, the user expresses an increase in volume, but the smart speaker only increases the volume based on a default program and cannot adjust the increase in volume based on the emotion in the user's voice.
Another problem is that it is difficult to extract emotion information from voice control commands because of different expression habits, different timbre characteristics, and interference from other external sounds. Therefore, in order to realize more intelligent control on the household appliance so as to bring better experience to the user, a voice control scheme of the intelligent device is desired.
Disclosure of Invention
The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides a voice control method and system of an intelligent device and an electronic device, wherein a physiological signal is adopted to perform auxiliary emotion recognition, namely, the voice information is subjected to word segmentation and word embedding processing, the emotion information is combined together to obtain respective feature vectors through a context-based semantic understanding model, the emotion feature vectors are further used as reference vectors, a decoder attention mechanism is applied, and the sequence of the voice feature vectors is processed into context feature vectors based on the emotion feature vectors, so that subsequent classification control instructions are more accurate. Like this, can carry out intelligent control to intelligent domestic appliance better to bring more comfortable use experience for people.
According to an aspect of the present application, there is provided a voice control method of a smart device, including:
acquiring voice information used by a user for controlling intelligent equipment and emotion information of the user when applying the voice information;
performing word segmentation on the voice information, enabling each word after word segmentation to pass through a word embedding model to obtain a sequence of input vectors, and enabling the emotion information to pass through the word embedding model to obtain emotion vectors;
respectively passing the sequence of input vectors and the emotion vector through a converter-based encoder model to obtain a sequence of speech feature vectors and an emotion feature vector;
calculating a similarity between the emotion feature vector and each speech feature vector in the sequence of speech feature vectors to obtain an emotion context vector;
inputting each voice feature vector in the sequence of voice feature vectors into a decoder model to obtain emotion label classification probability vectors consisting of emotion label classification probability values corresponding to each voice feature vector;
calculating a masked converter value vector based on a converter structure of the encoder model between the emotion context vector and the emotion tag classification probability vector;
taking the mask converter value of each position in the mask converter value vector as a weighting coefficient, and weighting each voice feature vector in the voice feature vector sequence to obtain a weighted voice feature vector sequence;
splicing each weighted voice characteristic vector in the sequence of the weighted voice characteristic vectors into a classification characteristic vector;
passing the classified feature vectors through a classifier to obtain emotion labels of the voice information; and
and determining the type of the control instruction for the intelligent equipment based on the emotion label and the voice information.
In the voice control method of the intelligent device, acquiring the voice information used by the user to control the intelligent device and the emotion information of the user when applying the voice information includes: acquiring voice data used by the user for controlling the intelligent equipment; and performing voice recognition on the voice data to obtain the voice information.
In the voice control method of the intelligent device, acquiring the voice information used by the user to control the intelligent device and the emotion information of the user when applying the voice information includes: obtaining a physiological signal of the user when the voice data is sent out; performing Fourier transform-based feature extraction on the physiological signal to obtain a physiological vector; encoding the physiological vector using an encoder comprising a fully-connected layer and a one-dimensional convolutional layer to obtain a physiological feature vector; and passing the physiological feature vector through a classifier to obtain emotion category information as the emotion information.
In the above voice control method of an intelligent device, calculating a similarity between the emotion feature vector and each voice feature vector in the sequence of voice feature vectors to obtain an emotion context vector, includes: calculating an L2 distance between the emotional feature vector and each speech feature vector in the sequence of speech feature vectors as the similarity to obtain the emotional context vector.
In the above voice control method of an intelligent device, calculating a similarity between the emotion feature vector and each voice feature vector in the sequence of voice feature vectors to obtain an emotion context vector, includes: calculating a cosine distance between the emotion feature vector and each speech feature vector in the sequence of speech feature vectors as the similarity to obtain the emotion context vector.
In the above method for controlling speech of a smart device, calculating a vector of masked converter values based on a converter structure of the encoder model between the emotion context vector and the emotion label classification probability vector includes: calculating a masked converter value vector based on the encoder model's converter structure between the emotion context vector and the emotion label classification probability vector in the following formula;
the formula is:
Vcis an emotional context vector, ViIs a sentiment tag classification probability vector, d is VcAnd ViAnd M represents whether a mask exists in the encoding process of each voice feature vector, if M exists, the value is alpha, otherwise, the value is-alpha.
In the above voice control method of the smart device, passing the classification feature vector through a classifier to obtain an emotion label of the voice information, includes: inputting the classification feature vector into a Softmax classification function of the classifier to obtain probability values that the classification feature vector belongs to respective emotion labels; and determining the emotion label with the maximum probability value as the emotion label of the voice information.
According to another aspect of the present application, there is provided a voice control system of a smart device, including:
the intelligent device comprises an information acquisition unit, a processing unit and a processing unit, wherein the information acquisition unit is used for acquiring voice information used by a user for controlling the intelligent device and emotion information of the user when the voice information is applied;
the word embedding model processing unit is used for carrying out word segmentation on the voice information obtained by the information obtaining unit, enabling each word after word segmentation to pass through a word embedding model to obtain a sequence of input vectors, and enabling the emotion information obtained by the information obtaining unit to pass through the word embedding model to obtain emotion vectors;
an encoder processing unit, configured to pass the sequence of input vectors obtained by the word embedding model processing unit and the emotion vectors obtained by the word embedding model processing unit through a converter-based encoder model to obtain a sequence of speech feature vectors and emotion feature vectors, respectively;
a similarity calculation unit configured to calculate a similarity between the emotion feature vector obtained by the encoder processing unit and each speech feature vector in the sequence of speech feature vectors obtained by the encoder processing unit to obtain an emotion context vector;
a decoder processing unit, configured to input each voice feature vector in the sequence of voice feature vectors obtained by the encoder processing unit into a decoder model to obtain an emotion tag classification probability vector composed of emotion tag classification probability values corresponding to each voice feature vector;
a mask converter value vector generating unit for calculating a mask converter value vector of the converter structure based on the encoder model between the emotion context vector obtained by the similarity calculation unit and the emotion label classification probability vector obtained by the decoder processing unit;
a weighting unit configured to weight each speech feature vector in the sequence of speech feature vectors obtained by the encoder processing unit with a mask converter value at each position in the mask converter value vector obtained by the mask converter value vector generating unit as a weighting coefficient to obtain a sequence of weighted speech feature vectors;
the splicing unit is used for splicing each weighted voice feature vector in the sequence of the weighted voice feature vectors obtained by the weighting unit into a classified feature vector;
the classification unit is used for enabling the classification characteristic vectors obtained by the splicing unit to pass through a classifier so as to obtain emotion labels of the voice information; and
and the instruction type determining unit is used for determining the type of the control instruction for the intelligent equipment based on the emotion label and the voice information obtained by the classifying unit.
According to yet another aspect of the present application, there is provided an electronic device including: a processor; and a memory in which are stored computer program instructions which, when executed by the processor, cause the processor to perform the method of voice control of a smart device as described above.
According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the method of voice control of a smart device as described above.
Compared with the prior art, the voice control method, the voice control system and the electronic device of the intelligent device adopt physiological signals to assist emotion recognition, namely, the voice information is subjected to word segmentation and word embedding processing, the emotion information is combined together to obtain respective feature vectors through a context-based semantic understanding model, the emotion feature vectors are further used as reference vectors, a decoder attention mechanism is applied, and the sequence of the voice feature vectors is processed into context feature vectors based on the emotion feature vectors, so that subsequent classification control instructions are more accurate. Like this, can carry out intelligent control to intelligent domestic appliance better to bring more comfortable use experience for people.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.
Fig. 1 is an application scenario diagram of a voice control method of an intelligent device according to an embodiment of the present application;
FIG. 2 is a flow chart of a voice control method of a smart device according to an embodiment of the present application;
fig. 3 is a system architecture diagram illustrating a voice control method of an intelligent device according to an embodiment of the present application;
fig. 4 is a flowchart of acquiring voice information used by a user to control an intelligent device and emotion information of the user when applying the voice information in a voice control method of the intelligent device according to an embodiment of the present application;
FIG. 5 is a block diagram of a voice control system of a smart device according to an embodiment of the present application;
fig. 6 is a block diagram of an information obtaining unit in a voice control system of an intelligent device according to an embodiment of the present application;
fig. 7 is a block diagram of an electronic device according to an embodiment of the application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
Overview of a scene
As described above, with the rapid development of economy and the continuous improvement of living standard of people, more and more electronic devices represented by household appliances come into common families, and the electronic devices meet various requirements of people, enrich the life colors of people, and greatly improve the life quality of people.
However, the operation of these electrical appliances requires frequent manual switch-on, which brings inconvenience to people and is not in accordance with the requirements of smart homes.
Some manufacturers solve the control problem of the smart home through voice control, namely, recognize a voice instruction provided by a user and then control the voice instruction. However, there still exist some problems, one of which is that the existing voice control only utilizes semantic information of voice, and does not utilize mode information expressed in voice, for example, emotion information, such as when a user wants to increase the volume of a smart speaker, the user expresses an increase in volume, but the smart speaker only increases the volume based on a default program and cannot adjust the increase in volume based on the emotion in the user's voice.
Another problem is that it is difficult to extract emotion information from voice control commands because of different expression habits, different timbre characteristics, and interference from other external sounds. Therefore, in order to realize more intelligent control on the household appliance so as to bring better experience to the user, a voice control scheme of the intelligent device is desired.
Based on this, in the technical scheme of the application, first, voice information used by a user to control an intelligent device and emotion information of the user when the voice information is applied are obtained, then, word segmentation and word embedding are performed on the voice information to obtain a sequence of input vectors, then, emotion vectors are obtained based on the emotion information, and the sequence of voice feature vectors and the emotion feature vectors are obtained through a context-based semantic understanding model, for example, a Bert model based on a converter (transformer).
Then, in a decoder structure in which emotion labels are obtained from the sequence of speech feature vectors, the sequence of speech feature vectors is processed into context feature vectors based on emotion feature vectors using the emotion feature vectors as reference vectors and applying a decoder attention mechanism.
That is, first, the similarity between the emotion feature vector and each speech feature vector in the sequence of speech feature vectors, for example, the euclidean distance, is calculated to obtain an emotion context vector. Then inputting each voice feature vector in the sequence of semantic feature vectors into a decoder to obtain an emotion label classification probability vector corresponding to each semantic feature vector (namely, a vector consisting of probability values belonging to each emotion label corresponding to each semantic feature vector), and calculating a mask converter value of the emotion context vector and the emotion label classification probability vector of each voice feature vector based on a converter structure of the encoder:
Vcis an emotional context vector, ViIs the emotion label classification probability vector of each speech feature vector, d is VcAnd ViAnd M represents whether a mask exists in the encoding process of the speech feature vector, if M exists, the value of M is alpha, otherwise, the value of M is-alpha.
Thus, a weighting coefficient for each speech feature vector can be obtained, and then the speech feature vectors are spliced based on the weighting coefficients to obtain the classified feature vectors. The weighting coefficients represent the input to the decoder clipped based on the encoder state and the hidden state of the decoder, and the classification feature vector represents a weighted sum of the encoder state based on the previous hidden state of the decoder.
Then, classification is carried out based on the classification feature vector to obtain the emotion label of the voice information.
Based on this, the present application provides a voice control method for an intelligent device, which includes: acquiring voice information used by a user for controlling intelligent equipment and emotion information of the user when applying the voice information; performing word segmentation on the voice information, enabling each word after word segmentation to pass through a word embedding model to obtain a sequence of input vectors, and enabling the emotion information to pass through the word embedding model to obtain emotion vectors; respectively passing the sequence of input vectors and the emotion vector through a converter-based encoder model to obtain a sequence of speech feature vectors and an emotion feature vector; calculating a similarity between the emotion feature vector and each voice feature vector in the sequence of voice feature vectors to obtain an emotion context vector; inputting each voice feature vector in the sequence of voice feature vectors into a decoder model to obtain emotion label classification probability vectors consisting of emotion label classification probability values corresponding to each voice feature vector; calculating a masked converter value vector based on a converter structure of the encoder model between the emotion context vector and the emotion tag classification probability vector; taking the mask converter value of each position in the mask converter value vector as a weighting coefficient, and weighting each voice feature vector in the voice feature vector sequence to obtain a weighted voice feature vector sequence; splicing each weighted voice characteristic vector in the sequence of the weighted voice characteristic vectors into a classification characteristic vector; passing the classified feature vectors through a classifier to obtain emotion labels of the voice information; and determining the type of the control instruction for the intelligent equipment based on the emotion label and the voice information.
Fig. 1 illustrates an application scenario of a voice control method of a smart device according to an embodiment of the present application. As shown in fig. 1, in this application scenario, first, voice data used by a user (e.g., P as illustrated in fig. 1) to control a smart device is acquired by a voice receiving end of the smart device (e.g., T as illustrated in fig. 1), and a physiological signal of the user when the voice data is emitted is acquired by a wearable electronic device (e.g., H as illustrated in fig. 1) worn by the user. The intelligent device includes, but is not limited to, intelligent electronic devices such as an intelligent sound box and an intelligent electric lamp, the wearable electronic device includes, but is not limited to, an intelligent bracelet, an intelligent neck ring, and the like, and the physiological signal includes, but is not limited to, electrocardiograph signal data, pulse wave signal data, skin electrical signal data, and the like.
Then, the obtained voice data used by the user to control the smart device and the physiological signal of the user when the user sends the voice data are input into a server (for example, S as illustrated in fig. 1) deployed with a voice control algorithm of the smart device, wherein the server can process the voice data used by the user to control the smart device and the physiological signal of the user when the user sends the voice data with the voice control algorithm of the smart device to generate a control instruction type for the smart device. And then, based on the type of the control instruction, the intelligent equipment is controlled. In this way, the mode information, e.g., emotion information, expressed in the voice information can be utilized when the user performs voice control, for example, when the user wants to increase the volume of the smart sound box, the user performs analysis according to the physiological signal of the user to obtain emotion information, so as to control how much the volume is increased.
Having described the basic principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.
Exemplary method
Fig. 2 illustrates a flow chart of a voice control method of a smart device. As shown in fig. 2, a voice control method of an intelligent device according to an embodiment of the present application includes: s110, acquiring voice information used by a user for controlling intelligent equipment and emotion information of the user when applying the voice information; s120, performing word segmentation processing on the voice information, enabling each word after word segmentation to pass through a word embedding model to obtain a sequence of input vectors, and enabling the emotion information to pass through the word embedding model to obtain emotion vectors; s130, respectively passing the sequence of the input vectors and the emotion vectors through a converter-based encoder model to obtain a sequence of voice feature vectors and emotion feature vectors; s140, calculating the similarity between the emotion feature vector and each voice feature vector in the sequence of the voice feature vectors to obtain an emotion context vector; s150, inputting each voice feature vector in the voice feature vector sequence into a decoder model to obtain an emotion label classification probability vector consisting of emotion label classification probability values corresponding to each voice feature vector; s160, calculating a mask converter value vector of the converter structure based on the encoder model between the emotion context vector and the emotion label classification probability vector; s170, taking the mask converter value of each position in the mask converter value vector as a weighting coefficient, and weighting each speech feature vector in the sequence of speech feature vectors to obtain a sequence of weighted speech feature vectors; s180, splicing each weighted voice feature vector in the sequence of the weighted voice feature vectors into a classified feature vector; s190, enabling the classified feature vectors to pass through a classifier to obtain emotion labels of the voice information; and S200, determining the type of the control instruction for the intelligent equipment based on the emotion label and the voice information.
Fig. 3 illustrates an architecture diagram of a voice control method of a smart device according to an embodiment of the present application. As shown in fig. 3, in the network architecture of the voice control method of the smart device, first, the obtained voice information (e.g., P1 as illustrated in fig. 3) is subjected to word segmentation processing and each word after word segmentation is passed through a word embedding model (e.g., WEM as illustrated in fig. 3) to obtain a sequence of input vectors (e.g., V1 as illustrated in fig. 3), and the obtained emotion information (e.g., P2 as illustrated in fig. 3) is passed through the word embedding model to obtain emotion vectors (e.g., V2 as illustrated in fig. 3); then, passing the sequence of input vectors and the emotion vector through a converter-based encoder model (e.g., E as illustrated in fig. 3) to obtain a sequence of speech feature vectors (e.g., VF1 as illustrated in fig. 3) and emotion feature vectors (e.g., VF2 as illustrated in fig. 3), respectively; then, calculating a similarity between the emotion feature vector and each speech feature vector in the sequence of speech feature vectors to obtain an emotion context vector (e.g., as illustrated in fig. 3 as V3); then, inputting each voice feature vector in the sequence of voice feature vectors into a decoder model (e.g., D as illustrated in fig. 3) to obtain emotion label classification probability vectors (e.g., V4 as illustrated in fig. 3) composed of emotion label classification probability values corresponding to each of the voice feature vectors; then, a masked converter value vector based on the converter structure of the encoder model between the emotion context vector and the emotion label classification probability vector is calculated (e.g., as illustrated in fig. 3 as V5); then, weighting each speech feature vector in the sequence of speech feature vectors with the mask converter value of each position in the mask converter value vector as a weighting coefficient to obtain a sequence of weighted speech feature vectors (e.g., VF3 as illustrated in fig. 3); then, concatenating each weighted speech feature vector of the sequence of weighted speech feature vectors into a classification feature vector (e.g., VF as illustrated in fig. 3); then, passing the classified feature vector through a classifier (e.g., circle S as illustrated in fig. 3) to obtain an emotion label of the voice information; and finally, determining the type of the control instruction for the intelligent equipment based on the emotion label and the voice information.
In step S110, voice information used by a user to control a smart device and emotion information of the user when applying the voice information are acquired. As described above, since the existing voice control uses only semantic information of voice without using mode information expressed in the voice, for example, emotion information, such as when a user wants to increase the volume of a smart speaker, the smart speaker expresses an increase in volume, but the smart speaker increases the volume only based on a default program and cannot adjust the increase in volume based on emotion in the user's voice. Therefore, in the technical scheme of the application, when the user performs voice control on the intelligent device, the control is expected to be performed by combining the emotion information on the basis of understanding the voice semantic information, so that the experience feeling of the control on the intelligent device is better.
Accordingly, in a specific example, it is first required to obtain, by a voice receiving end of a smart device, voice data used by a user to control the smart device, and obtain, by a wearable electronic device worn by the user, a physiological signal of the user when the user sends out the voice data. The intelligent device includes, but is not limited to, intelligent electronic devices such as an intelligent sound box and an intelligent electric lamp, the wearable electronic device includes, but is not limited to, an intelligent bracelet, an intelligent neck ring, and the like, and the physiological signal includes, but is not limited to, electrocardiograph signal data, pulse wave signal data, skin electrical signal data, and the like. It should be understood that, because the expression habits of the respective users are different, the tone characteristics are different, and there is also interference from other external sounds, it is difficult to extract the emotion information in the voice control instruction, and therefore, in the technical solution of the present application, the physiological signal is used to perform auxiliary emotion recognition to control how much the volume of the voice is increased.
Specifically, in the embodiment of the present application, a process of acquiring voice information used by a user to control a smart device and emotion information of the user when applying the voice information includes: first, voice data used by the user to control the smart device, including but not limited to smart electronic devices such as smart speakers and smart lights, is obtained. Then, voice recognition is carried out on the voice data to obtain the voice information. Then, the physiological signals of the user when the voice data is sent are obtained, wherein the physiological signals include but are not limited to electrocardio signal data, pulse wave signal data, skin electric signal data and the like. Then, the physiological signal is subjected to feature extraction based on Fourier transform to obtain a physiological vector, wherein the feature extraction based on Fourier transform can map time domain information of the physiological signal into frequency domain information so as to better perform feature extraction on the physiological signal. Then, the physiological vector is encoded by using an encoder comprising a full-link layer and a one-dimensional convolutional layer so as to extract high-dimensional implicit features of feature values of all positions in the physiological vector and high-dimensional implicit features among the feature values of all the positions, and therefore the physiological feature vector is obtained. And finally, the physiological characteristic vector passes through a classifier to obtain emotion category information as the emotion information.
Fig. 4 illustrates a flowchart of acquiring voice information used by a user to control a smart device and emotion information of the user when applying the voice information in a voice control method of the smart device according to an embodiment of the present application. As shown in fig. 4, in the embodiment of the present application, acquiring voice information used by a user to control a smart device and emotion information of the user when applying the voice information includes: s210, acquiring a physiological signal of the user when the user sends the voice data; s220, performing Fourier transform-based feature extraction on the physiological signal to obtain a physiological vector; s230, encoding the physiological vector by using an encoder comprising a full connection layer and a one-dimensional convolution layer to obtain a physiological characteristic vector; s240, the physiological characteristic vector is processed through a classifier to obtain emotion category information as the emotion information.
In steps S120 and S130, the speech information is participled and each participled word is passed through a word embedding model to obtain a sequence of input vectors, and the emotion information is passed through the word embedding model to obtain emotion vectors, and the sequence of input vectors and the emotion vectors are passed through a converter-based encoder model to obtain a sequence of speech feature vectors and emotion feature vectors, respectively. That is, in the technical solution of the present application, after the voice information and the emotion information are obtained, word segmentation processing is performed on the voice information to prevent semantic confusion, thereby improving the accuracy of subsequent semantic understanding. And then, processing each word and the emotion information after word segmentation in a word embedding model respectively to obtain a sequence of input vectors and emotion vectors. Then, the sequence of the input vectors and the emotion vectors are further passed through a context-based semantic understanding model, such as a Bert model based on a converter, respectively, to extract context-based semantic information features in the sequence of the input vectors and the emotion vectors, so as to obtain a sequence of speech feature vectors and an emotion feature vector.
In step S140, a similarity between the emotion feature vector and each speech feature vector in the sequence of speech feature vectors is calculated to obtain an emotion context vector. It should be understood that in the decoder structure that obtains the emotion tag through the sequence of the speech feature vectors, the sequence of the speech feature vectors is processed into the context feature vectors based on the emotion feature vectors by using the emotion feature vectors as reference vectors and applying a decoder attention mechanism. That is, in the technical solution of the present application, it is first required to calculate a similarity between the emotion feature vector and each speech feature vector in the sequence of speech feature vectors, so as to obtain an emotion context vector.
Accordingly, in one particular example, an L2 distance between the emotional feature vector and each speech feature vector in the sequence of speech feature vectors may be calculated as the similarity to obtain the emotional context vector. It should be understood that the L2 distance function, also known as the Least Squares Error (LSE), is the sum of the squares of the differences between the target and estimated values, also known as the euclidean distance, and is expressed by the formula D ═ xi-yi|2,xiFeature values, y, representing respective positions in the emotional feature vectoriA feature value representing a respective position of each speech feature vector in the sequence of speech feature vectors. Here, calculating the L2 distance between the emotion feature vector and each of the speech feature vectors in the sequence of speech feature vectors can reflect the degree of feature difference between the emotion feature vector and each of the speech feature vectors in the sequence of speech feature vectors in a numerical dimension.
In particular, in another specific example, a cosine distance between the emotion feature vector and each speech feature vector in the sequence of speech feature vectors may also be calculated as the similarity to obtain the emotion context vector. It should be understood that cosine similarity is the cosine value of the angle between two vectors, i.e. for two vectors a and B, the remaining chord similarity is defined as:the cosine distance is obtained by subtracting the cosine similarity from 1, and the value range of the cosine similarity is [ -1,1 [ ]]Thus the cosine distance has a value in the range of [0,2 ]]。
In steps S150 and S160, each voice feature vector in the sequence of voice feature vectors is input into a decoder model to obtain an emotion tag classification probability vector consisting of emotion tag classification probability values corresponding to each voice feature vector, and a masked converter value vector based on a converter structure of the encoder model between the emotion context vector and the emotion tag classification probability vector is calculated. It should be understood that, after obtaining the emotion context vector, each speech feature vector in the sequence of semantic feature vectors is input into a decoder model for processing, so as to obtain an emotion tag classification probability vector corresponding to each semantic feature vector, where the emotion tag classification probability vector is a vector composed of probability values belonging to each emotion tag and corresponding to each semantic feature vector. Then, calculating a mask converter value vector based on a converter structure of an encoder between the emotion context vector and the emotion label classification probability vector of each voice feature vector so as to perform weighting processing on each voice feature vector in the sequence of the voice feature vectors, thereby enabling the emotion label of the obtained voice information to be more accurate.
Specifically, in the embodiment of the present application, the process of calculating a mask converter value vector based on the converter structure of the encoder model between the emotion context vector and the emotion label classification probability vector includes: calculating a masked converter value vector based on the encoder model's converter structure between the emotion context vector and the emotion label classification probability vector in the following formula;
the formula is:
Vcis an emotional context vector, ViIs an emotion label classification probability vector for each of said speech feature vectors, d is VcAnd ViAnd M represents whether a mask exists in the encoding process of each voice feature vector, and if M exists, the value of M is alpha, otherwise, the value is-alpha.
In steps S170 and S180, each speech feature vector in the sequence of speech feature vectors is weighted by using a mask converter value at each position in the mask converter value vector as a weighting coefficient to obtain a sequence of weighted speech feature vectors, and each weighted speech feature vector in the sequence of weighted speech feature vectors is spliced into a classification feature vector. That is, in the technical solution of the present application, after obtaining the mask converter value vector, taking the mask converter value at each position in the mask converter value vector as a weighting coefficient, and then weighting each speech feature vector in the sequence of speech feature vectors based on the weighting coefficient to obtain the sequence of weighted speech feature vectors. Furthermore, each weighted voice feature vector in the sequence of the weighted voice feature vectors is spliced into a classification feature vector so as to facilitate subsequent classification processing. It should be understood that, in this way, the emotion information of the user can be fused on the basis of the speech information based on semantic understanding of the subsequent classification results, so that the control of the intelligent device is more accurate.
In step S190 and step S200, the classified feature vectors are passed through a classifier to obtain emotion labels of the voice information, and a control instruction type for the smart device is determined based on the emotion labels and the voice information. Accordingly, in one specific example, first, the classification feature vector is input into the Softmax classification function of the classifier to obtain probability values that the classification feature vector belongs to respective emotion labels. Then, the emotion label with the maximum probability value is determined as the emotion label of the voice information. And finally, determining the type of the control instruction for the intelligent equipment based on the emotion label and the voice information. In this way, the mode information, e.g., emotion information, expressed in the voice information can be utilized when the user performs voice control, for example, when the user wants to increase the volume of the smart sound box, the user can analyze the physiological signal of the user to obtain emotion information, so as to control how much the volume is increased, and thus, a more comfortable use experience can be brought to people.
In summary, the speech control method of the smart device according to the embodiment of the present application is elucidated, which uses a physiological signal to perform assisted emotion recognition, that is, performs word segmentation and word embedding on the speech information, obtains respective feature vectors by combining the emotion information and a context-based semantic understanding model, further uses the emotion feature vectors as reference vectors, applies a decoder attention mechanism, and processes a sequence of the speech feature vectors into context feature vectors based on the emotion feature vectors, so that subsequent classification control instructions are more accurate. Like this, can carry out intelligent control to intelligent domestic appliance better to bring more comfortable use experience for people.
Exemplary System
FIG. 5 illustrates a block diagram of a voice control system of a smart device according to an embodiment of the application. As shown in fig. 5, a speech control system 500 of a smart device according to an embodiment of the present application includes: an information obtaining unit 510, configured to obtain voice information used by a user to control a smart device and emotion information of the user when applying the voice information; a word embedding model processing unit 520, configured to perform word segmentation on the speech information obtained by the information obtaining unit 510, pass each word after word segmentation through a word embedding model to obtain a sequence of input vectors, and pass the emotion information obtained by the information obtaining unit 510 through the word embedding model to obtain emotion vectors; an encoder processing unit 530 for passing the sequence of input vectors obtained by the word embedding model processing unit 520 and the emotion vectors obtained by the word embedding model processing unit 520 through a converter-based encoder model to obtain a sequence of speech feature vectors and emotion feature vectors, respectively; a similarity calculation unit 540, configured to calculate a similarity between the emotion feature vector obtained by the encoder processing unit 530 and each speech feature vector in the sequence of speech feature vectors obtained by the encoder processing unit 530 to obtain an emotion context vector; a decoder processing unit 550, configured to input each voice feature vector in the sequence of voice feature vectors obtained by the encoder processing unit 530 into a decoder model to obtain an emotion tag classification probability vector composed of emotion tag classification probability values corresponding to each voice feature vector; a mask converter value vector generating unit 560 for calculating a mask converter value vector of the converter structure based on the encoder model between the emotion context vector obtained by the similarity calculating unit 540 and the emotion label classification probability vector obtained by the decoder processing unit 550; a weighting unit 570, configured to weight each speech feature vector in the sequence of speech feature vectors obtained by the encoder processing unit 530 with a masking converter value at each position in the masking converter value vector obtained by the masking converter value vector generating unit 560 as a weighting coefficient to obtain a sequence of weighted speech feature vectors; a splicing unit 580, configured to splice each weighted speech feature vector in the sequence of weighted speech feature vectors obtained by the weighting unit 570 into a classification feature vector; a classification unit 590, configured to pass the classification feature vector obtained by the splicing unit 580 through a classifier to obtain an emotion label of the voice information; and an instruction type determining unit 600, configured to determine a control instruction type for the smart device based on the emotion label and the voice information obtained by the classifying unit 590.
In an example, in the voice control system 500 of the smart device, the information obtaining unit 510 is further configured to: acquiring voice data used by the user for controlling the intelligent equipment; and performing voice recognition on the voice data to obtain the voice information.
In an example, in the voice control system 500 of the intelligent device, as shown in fig. 6, the information obtaining unit 510 includes: a physiological signal acquiring subunit 511, configured to acquire a physiological signal of the user when the user utters the voice data; a physiological vector generation subunit 512, configured to perform feature extraction based on fourier transform on the physiological signal obtained by the physiological signal obtaining subunit 511 to obtain a physiological vector; a physiological feature vector generation subunit 513 configured to encode the physiological vector obtained by the physiological vector generation subunit 512 using an encoder including a fully-connected layer and a one-dimensional convolutional layer to obtain a physiological feature vector; and an emotion category information generation subunit 514 configured to pass the physiological feature vector obtained by the physiological feature vector generation subunit 513 through a classifier to obtain emotion category information as the emotion information.
In an example, in the voice control system 500 of the smart device, the similarity calculating unit 540 is further configured to: calculating an L2 distance between the emotional feature vector and each speech feature vector in the sequence of speech feature vectors as the similarity to obtain the emotional context vector.
In an example, in the voice control system 500 of the smart device, the similarity calculating unit 540 is further configured to: calculating a cosine distance between the emotion feature vector and each speech feature vector in the sequence of speech feature vectors as the similarity to obtain the emotion context vector.
In an example, in the voice control system 500 of the intelligent device, the mask converter value vector generating unit 560 is further configured to: calculating a masked converter value vector based on the encoder model's converter structure between the emotion context vector and the emotion label classification probability vector in the following formula;
the formula is:
Vcis an emotional context vector, ViIs a sentiment tag classification probability vector, d is VcAnd ViAnd M represents whether a mask exists in the encoding process of each voice feature vector, if M exists, the value is alpha, otherwise, the value is-alpha.
In an example, in the voice control system 500 of the smart device, the classifying unit 590 is further configured to: inputting the classification feature vector into a Softmax classification function of the classifier to obtain probability values that the classification feature vector belongs to respective emotion labels; and determining the emotion label with the maximum probability value as the emotion label of the voice information.
Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the voice control system 500 of the smart device described above have been described in detail in the above description of the voice control method of the smart device with reference to fig. 1 to 4, and thus, a repetitive description thereof will be omitted.
As described above, the voice control system 500 of the smart device according to the embodiment of the present application may be implemented in various terminal devices, such as a server of a voice control algorithm of the smart device. In one example, the voice control system 500 of the smart device according to the embodiment of the present application may be integrated into the terminal device as one software module and/or hardware module. For example, the voice control system 500 of the smart device may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the voice control system 500 of the smart device may also be one of many hardware modules of the terminal device.
Alternatively, in another example, the voice control system 500 of the smart device and the terminal device may be separate devices, and the voice control system 500 of the smart device may be connected to the terminal device through a wired and/or wireless network and transmit the interactive information according to an agreed data format.
Exemplary electronic device
Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 7. As shown in fig. 7, the electronic device 10 includes one or more processors 11 and a memory 12. The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.
In one example, the electronic device 10 may further include: an input system 13 and an output system 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
The input system 13 may comprise, for example, a keyboard, a mouse, etc.
The output system 14 may output various information including control instruction types and the like to the outside. The output system 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.
Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 7, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and devices, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the functions of the voice control method of a smart device according to various embodiments of the present application described in the "exemplary methods" section of this specification above.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the voice control method of a smart device described in the "exemplary methods" section above of this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The basic principles of the present application have been described above with reference to specific embodiments, but it should be noted that advantages, effects, etc. mentioned in the present application are only examples and are not limiting, and the advantages, effects, etc. must not be considered to be possessed by various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.
Claims (10)
1. A voice control method of an intelligent device is characterized by comprising the following steps:
acquiring voice information used by a user for controlling intelligent equipment and emotion information of the user when applying the voice information;
performing word segmentation on the voice information, enabling each word after word segmentation to pass through a word embedding model to obtain a sequence of input vectors, and enabling the emotion information to pass through the word embedding model to obtain emotion vectors;
respectively passing the sequence of input vectors and the emotion vector through a converter-based encoder model to obtain a sequence of speech feature vectors and an emotion feature vector;
calculating a similarity between the emotion feature vector and each speech feature vector in the sequence of speech feature vectors to obtain an emotion context vector;
inputting each voice feature vector in the sequence of voice feature vectors into a decoder model to obtain emotion label classification probability vectors consisting of emotion label classification probability values corresponding to each voice feature vector;
calculating a masked converter value vector based on a converter structure of the encoder model between the emotion context vector and the emotion tag classification probability vector;
taking the mask converter value of each position in the mask converter value vector as a weighting coefficient, and weighting each voice feature vector in the voice feature vector sequence to obtain a weighted voice feature vector sequence;
splicing each weighted voice characteristic vector in the sequence of the weighted voice characteristic vectors into a classification characteristic vector;
passing the classified feature vectors through a classifier to obtain emotion labels of the voice information; and
and determining the type of the control instruction for the intelligent equipment based on the emotion label and the voice information.
2. The voice control method of the smart device according to claim 1, wherein the acquiring of the voice information of the user for controlling the smart device and the emotion information of the user when applying the voice information comprises:
acquiring voice data used by the user for controlling the intelligent equipment; and
and performing voice recognition on the voice data to obtain the voice information.
3. The voice control method of the smart device according to claim 2, wherein the acquiring of the voice information of the user for controlling the smart device and the emotion information of the user when applying the voice information comprises:
obtaining a physiological signal of the user when the voice data is sent out;
performing Fourier transform-based feature extraction on the physiological signal to obtain a physiological vector;
encoding the physiological vector using an encoder comprising a fully-connected layer and a one-dimensional convolutional layer to obtain a physiological feature vector; and
and passing the physiological feature vector through a classifier to obtain emotion category information as the emotion information.
4. The voice control method of a smart device according to claim 3, wherein calculating a similarity between the emotional feature vector and each voice feature vector in the sequence of voice feature vectors to obtain an emotional context vector comprises:
calculating an L2 distance between the emotional feature vector and each speech feature vector in the sequence of speech feature vectors as the similarity to obtain the emotional context vector.
5. The voice control method of a smart device according to claim 3, wherein calculating a similarity between the emotional feature vector and each voice feature vector in the sequence of voice feature vectors to obtain an emotional context vector comprises:
calculating a cosine distance between the emotion feature vector and each speech feature vector in the sequence of speech feature vectors as the similarity to obtain the emotion context vector.
6. The method of speech control of a smart device according to claim 4 or 5, wherein calculating a masked converter value vector based on the converter structure of the encoder model between the emotion context vector and the emotion tag classification probability vector comprises:
calculating a masked converter value vector based on the encoder model's converter structure between the emotion context vector and the emotion label classification probability vector in the following formula;
the formula is:
Vcis an emotional context vector, ViIs a sentiment tag classification probability vector, d is VcAnd ViAnd M represents whether a mask exists in the encoding process of each voice feature vector, if M exists, the value is alpha, otherwise, the value is-alpha.
7. The voice control method of the smart device according to claim 6, wherein passing the classified feature vectors through a classifier to obtain emotion labels of the voice information comprises:
inputting the classification feature vector into a Softmax classification function of the classifier to obtain probability values that the classification feature vector belongs to respective emotion labels; and
and determining the emotion label with the maximum probability value as the emotion label of the voice information.
8. A voice control system for a smart device, comprising:
the intelligent device comprises an information acquisition unit, a processing unit and a processing unit, wherein the information acquisition unit is used for acquiring voice information used by a user for controlling the intelligent device and emotion information of the user when the voice information is applied;
the word embedding model processing unit is used for carrying out word segmentation on the voice information obtained by the information obtaining unit, enabling each word after word segmentation to pass through a word embedding model to obtain a sequence of input vectors, and enabling the emotion information obtained by the information obtaining unit to pass through the word embedding model to obtain emotion vectors;
an encoder processing unit, configured to pass the sequence of input vectors obtained by the word embedding model processing unit and the emotion vectors obtained by the word embedding model processing unit through a converter-based encoder model to obtain a sequence of speech feature vectors and emotion feature vectors, respectively;
a similarity calculation unit configured to calculate a similarity between the emotion feature vector obtained by the encoder processing unit and each speech feature vector in the sequence of speech feature vectors obtained by the encoder processing unit to obtain an emotion context vector;
a decoder processing unit, configured to input each voice feature vector in the sequence of voice feature vectors obtained by the encoder processing unit into a decoder model to obtain an emotion tag classification probability vector composed of emotion tag classification probability values corresponding to each voice feature vector;
a mask converter value vector generating unit for calculating a mask converter value vector of the converter structure based on the encoder model between the emotion context vector obtained by the similarity calculation unit and the emotion label classification probability vector obtained by the decoder processing unit;
a weighting unit configured to weight each speech feature vector in the sequence of speech feature vectors obtained by the encoder processing unit with a mask converter value at each position in the mask converter value vector obtained by the mask converter value vector generating unit as a weighting coefficient to obtain a sequence of weighted speech feature vectors;
the splicing unit is used for splicing each weighted voice feature vector in the sequence of the weighted voice feature vectors obtained by the weighting unit into a classified feature vector;
the classification unit is used for enabling the classification characteristic vectors obtained by the splicing unit to pass through a classifier so as to obtain emotion labels of the voice information; and
and the instruction type determining unit is used for determining the type of the control instruction for the intelligent equipment based on the emotion label and the voice information obtained by the classifying unit.
9. The voice control system of a smart device of claim 8, wherein the information obtaining unit is further configured to:
acquiring voice data used by the user for controlling the intelligent equipment; and performing voice recognition on the voice data to obtain the voice information.
10. An electronic device, comprising:
a processor; and
memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform a method of voice control of a smart device as claimed in any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210235784.0A CN114495935A (en) | 2022-03-11 | 2022-03-11 | Voice control method and system of intelligent device and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210235784.0A CN114495935A (en) | 2022-03-11 | 2022-03-11 | Voice control method and system of intelligent device and electronic device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114495935A true CN114495935A (en) | 2022-05-13 |
Family
ID=81486758
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210235784.0A Withdrawn CN114495935A (en) | 2022-03-11 | 2022-03-11 | Voice control method and system of intelligent device and electronic device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114495935A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024027010A1 (en) * | 2022-08-01 | 2024-02-08 | Smart Lighting Holding Limited | Lighting control method, control system and storage medium |
-
2022
- 2022-03-11 CN CN202210235784.0A patent/CN114495935A/en not_active Withdrawn
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024027010A1 (en) * | 2022-08-01 | 2024-02-08 | Smart Lighting Holding Limited | Lighting control method, control system and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Purwins et al. | Deep learning for audio signal processing | |
JP7337953B2 (en) | Speech recognition method and device, neural network training method and device, and computer program | |
CN109101537B (en) | Multi-turn dialogue data classification method and device based on deep learning and electronic equipment | |
WO2020232860A1 (en) | Speech synthesis method and apparatus, and computer readable storage medium | |
CN108959482B (en) | Single-round dialogue data classification method and device based on deep learning and electronic equipment | |
Bellegarda et al. | The metamorphic algorithm: A speaker mapping approach to data augmentation | |
US20170011741A1 (en) | Method for Distinguishing Components of an Acoustic Signal | |
CN111161702A (en) | Personalized speech synthesis method and device, electronic equipment and storage medium | |
WO2023245389A1 (en) | Song generation method, apparatus, electronic device, and storage medium | |
CN114038457B (en) | Method, electronic device, storage medium, and program for voice wakeup | |
CN113837299B (en) | Network training method and device based on artificial intelligence and electronic equipment | |
Tiwari et al. | Virtual home assistant for voice based controlling and scheduling with short speech speaker identification | |
WO2023226239A1 (en) | Object emotion analysis method and apparatus and electronic device | |
CN114495935A (en) | Voice control method and system of intelligent device and electronic device | |
Kim et al. | Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning | |
CN116564269A (en) | Voice data processing method and device, electronic equipment and readable storage medium | |
CN113555003B (en) | Speech synthesis method, device, electronic equipment and storage medium | |
CN113593598B (en) | Noise reduction method and device for audio amplifier in standby state and electronic equipment | |
CN117116289B (en) | Medical intercom management system for ward and method thereof | |
CN114360491A (en) | Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium | |
CN117953874A (en) | Pre-training method, voice recognition method and related device for multi-mode general model | |
Liu et al. | Hierarchical component-attention based speaker turn embedding for emotion recognition | |
CN113051425A (en) | Method for acquiring audio representation extraction model and method for recommending audio | |
TW200935399A (en) | Chinese-speech phonologic transformation system and method thereof | |
CN116705058B (en) | Processing method of multimode voice task, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20220513 |