CN105761720B - Interactive system and method based on voice attribute classification - Google Patents

Interactive system and method based on voice attribute classification Download PDF

Info

Publication number
CN105761720B
CN105761720B CN201610244968.8A CN201610244968A CN105761720B CN 105761720 B CN105761720 B CN 105761720B CN 201610244968 A CN201610244968 A CN 201610244968A CN 105761720 B CN105761720 B CN 105761720B
Authority
CN
China
Prior art keywords
voice
attribute
classification
signal
acoustic feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610244968.8A
Other languages
Chinese (zh)
Other versions
CN105761720A (en
Inventor
潘复平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Horizon Robotics Technology Research and Development Co Ltd
Original Assignee
Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Horizon Robotics Technology Research and Development Co Ltd filed Critical Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority to CN201610244968.8A priority Critical patent/CN105761720B/en
Publication of CN105761720A publication Critical patent/CN105761720A/en
Application granted granted Critical
Publication of CN105761720B publication Critical patent/CN105761720B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Abstract

The application discloses an interactive system and method based on voice attribute classification. The system comprises: the acoustic feature extraction unit is used for extracting acoustic features of an input voice signal and generating a first signal; the voice attribute classification unit is configured for determining a voice attribute value of the first signal through the attribute recognition classifier, outputting a voice attribute result and generating a second signal; and the interaction decision unit is configured to output feedback information based on the second signal. The voice attribute classification unit simultaneously detects multiple voice attributes and outputs corresponding feedback information according to the voice attribute values, so that the interactive process is rich and colorful.

Description

Interactive system and method based on voice attribute classification
Technical Field
The present disclosure relates generally to the field of interaction, and more particularly to human-computer interaction technology, and more particularly to a voice attribute based interaction system.
Background
The conventional human-computer voice interaction process is characterized in that a machine recognizes a voice command sent by a person and then makes a corresponding response according to a recognition result. The content contained in the interaction is limited to the literal meaning of the voice command, the form is single, the user experience is boring, and the method is not suitable for interactive scenes such as toys and home furnishing which need lively and diverse forms.
At present, in human-computer interaction, a voiceprint registration technology is often adopted to judge the identity of a user, so that humanized interaction is realized. In the process of voiceprint registration, a voiceprint recognition technology is used for registering the voice of a user, the identity of the user is associated with the voiceprint, then in the process of using, the voiceprint of a speaker is recognized, then the identity of the speaker is judged according to the voiceprint, and then limited interactive changes are carried out according to the identity of the user. For example, some intelligent toys can judge whether the speaker is dad, mom or baby according to the sound, and the title of the speaker can be changed according to the identity.
The prior art has the disadvantages that on one hand, the traditional technology can only detect one voice attribute, and the change of the interactive content is very limited according to the difference of the voice attributes; voiceprint registration techniques, on the other hand, are cumbersome and inflexible to use.
Disclosure of Invention
In view of the above-mentioned shortcomings or drawbacks of the prior art, it is desirable to provide an interactive system based on voice attribute classification and a method thereof.
In a first aspect, an interactive system based on voice attribute classification is provided, the system including:
the acoustic feature extraction unit is used for extracting acoustic features of an input voice signal and generating a first signal;
the voice attribute classification unit is configured for determining a voice attribute value of the first signal through the attribute recognition classifier, outputting a voice attribute result and generating a second signal;
and the interaction decision unit is configured to output feedback information based on the second signal.
A second aspect provides an interactive method based on voice attribute classification, the method comprising:
extracting acoustic features of an input voice signal to generate a first signal;
the first signal is subjected to attribute identification and classification to determine a voice attribute value of the first signal, a voice attribute result is output, and a second signal is generated;
feedback information is output based on the second signal.
According to the technical scheme provided by the embodiment of the application, the voice attribute classification unit can simultaneously detect various voice attributes of voice and output corresponding feedback information according to each voice attribute value, so that the interactive process is rich and colorful; in addition, the invention can automatically judge the identity of the speaker through voice attribute classification, so that a registration process is not needed, and the method is simple and convenient to use, free and flexible.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a block diagram of an interactive system classified based on voice attributes, according to an embodiment.
FIG. 2 is a flow chart of an interaction method based on voice attribute classification.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
In the voice interaction process, besides the character content of the voice instruction, other voice attributes of the voice can be recognized, and the interactive form and content can be enriched by utilizing the voice attributes. These voice attributes include: age interval, gender, emotion, health degree, etc. of the speaker. Age and gender may be reflected in the fundamental frequency and timbre of speech; emotion can be reflected in stress, intonation, pace and pause of speech; the health degree can be reflected in whether the voice is hoarse, whether the voice is accompanied by cough, whether the voice is accompanied by nasal sound and the like. The same voice attribute of different speakers shows the same distribution rule on voice signals, for example, the fundamental frequency of male voices is low, the spectral energy is mostly concentrated in a low-frequency region, while the fundamental frequency of female voices is high, and the spectral energy is mostly concentrated in a high-frequency region. Based on the characteristics of the voice, a large amount of voice data with the same attribute can be collected, voice attribute marking data capable of reflecting the attribute is extracted, and an attribute recognition classifier is trained to facilitate classification of the voice attribute. Aiming at a plurality of voice attributes, a plurality of attribute recognition classifiers can be trained to carry out classification and judgment respectively. After a series of voice attribute values of voice are obtained, the voice attribute values are used for outputting interactive feedback information in a specific voice interactive scene according to a set interactive decision.
The invention can be applied to interactive scenes of ordering songs, some cheerful songs can be recommended if the emotion of the speaker is identified to be sad, and some mild songs can be recommended if the emotion of the speaker is impatient.
The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Referring to fig. 1, a block diagram of an embodiment of an interactive system based on voice attribute classification is shown, the system comprising:
an acoustic feature extraction unit 10 configured to extract an acoustic feature of an input speech signal and generate a first signal;
a voice attribute classifying unit 20 configured to classify the first signal by voice attribute, output a voice attribute result, and generate a second signal;
and the interaction decision unit 30 is configured to determine the type of interaction based on the second signal and output feedback information. The acoustic feature extraction unit 10 further includes a front-end processing unit configured to perform digital preprocessing and voice endpoint detection on the input voice signal, and the front-end processing unit is mainly responsible for acquiring an effective voice signal, reducing interference caused by silence and noise, and increasing the amount of computation.
The acoustic feature extraction unit 10 extracts a series of acoustic features reflecting the attributes of speech. The extracted acoustic features mainly include:
fundamental frequency: fundamental tone refers to the periodicity caused by vocal cord vibration when voiced sound is emitted, and fundamental frequency is the frequency of vocal cord vibration. The pitch is one of the most important parameters of a speech signal, and can represent information such as emotion, age, and sex included in speech. Accurate detection of the fundamental frequency becomes difficult due to the non-stationarity and non-periodicity of the speech signal and the wide variation range of the pitch period. The present embodiment detects the fundamental frequency using a cepstrum method.
MFCC (mel-frequency cepstrum coefficient): the spectral signature is a short-term signature. In order to extract the spectral features, in order to utilize the characteristics of the human auditory system, the spectrum of the speech signal is generally passed through a band-pass filter whose center frequency is based on the human perceptual scale, and then the spectral features are extracted from the filtered signal, and the embodiment adopts Mel Frequency Cepstral Coefficient (MFCC) features.
Resonance peak: when speaking, the vocal tract is continuously changed and adapted to make the speech clear, and the vocal tract length is also influenced by the emotional state of the speaker. The vocal tract plays a role of resonance when uttering, which causes a resonance characteristic when a vowel is excited into the vocal tract, resulting in a set of resonance frequencies, so-called formant frequencies, referred to as formants for short, which depend on the shape and physical characteristics of the vocal tract. Different vowels correspond to different formant parameters, more formants are used to better describe speech, and the first three signals are generally collected in practical application.
The above three features are basic acoustic features for the present invention, and the present invention can be implemented by classifying voice attributes based on the above acoustic features. In order to achieve better effect, the following acoustic features of the speaker can be further extracted:
short-time energy characteristics: the energy of the voice signal reflects the strength of the voice and has strong direct correlation with emotional information. The short-time energy is calculated from the signal time domain, and the sum of the squares of the signal amplitudes of a frame of speech is calculated.
Pitch jitter and flicker: the jitter refers to the base frequency jitter during the previous and next cycles, i.e. the pitch frequency variation amplitude of the previous and next two frames of speech signals. Flicker refers to energy flicker in two periods before and after, i.e. short-time energy change amplitude of two adjacent frames of speech signals before and after.
Harmonic to noise ratio: the method is characterized in that the ratio of harmonic waves to noise components in a voice signal can reflect emotional changes to a certain extent.
The voice attribute classification unit 20 sets at least one attribute recognition classifier according to the selected voice attribute, each attribute recognition classifier inputs the extracted acoustic features into the attribute recognition classifier by adopting a pattern recognition technology, and the output of the attribute recognition classifier is an attribute detection result. In this embodiment, 8 voice attributes are selected as classification objects, and a gender attribute, an age attribute, an emotion attribute, and a health attribute are respectively detected, which are specifically as follows:
gender attribute:
first voice attribute: for detecting male or female voices;
age attribute:
second voice attribute: for detecting whether a child or an adult;
the emotional property:
third voice attribute: for detecting if gas is generated;
fourth voice attribute: for detecting whether a user is anxious;
fifth voice attribute: for detecting cheerfulness;
health attributes:
sixth voice attribute: for detecting whether a cough has occurred;
seventh voice attribute: for detecting whether a nasal sound is present;
eighth voice attribute: for detecting whether it is sandy or not;
the working modes of the attribute recognition classifier are divided into two types: the first is a training mode, and the second is a testing mode. In the training mode, the attribute recognition classifier learns potential characteristics and rules in the data samples, collects a large number of data samples, manually marks the voice attribute class to which each data sample belongs, inputs the data samples and the corresponding voice attribute class marks into the attribute recognition classifier, and adjusts model parameters in the attribute recognition classifier by adopting a training algorithm. After training is finished, the characteristics of different types of data are reflected on the model parameters of the attribute recognition classifier, and the method can be used for testing new data. In the test mode, the attribute recognition classifier directly classifies the newly acquired data according to the previously learned rule without the step of manual labeling and outputs the classification result.
In this embodiment, a plurality of voice features need to be detected, different attribute recognition classifiers are respectively trained for each voice attribute, each attribute recognition classifier outputs two classes, and the probabilities of "positive" and "negative" of the attribute are identified. For example, male/female is output for gender attribute; outputting children/adults for age attributes; output yes/no, etc. for the health attribute.
The algorithm of the attribute recognition classifier has various choices, including a Support Vector Machine (SVM), a Gaussian Mixture Model (GMM), A Neural Network (ANN), a Deep Neural Network (DNN), etc., and the deep neural network is selected to form the attribute classification unit in this embodiment.
In this embodiment, 8 independent attribute recognition classifiers are designed for 8 attributes of speech, and the attribute recognition classifier adopts a deep neural network algorithm.
Each attribute recognition classifier DNN adopts the same structure: the input layer comprises 51 nodes corresponding to the acoustic features; the method comprises the following steps of (1) containing 4 hidden layers, wherein each hidden layer comprises 512 nodes; the output layer comprises two nodes corresponding to two categories of 'positive' and 'negative' of the voice attribute; the activation function of the hidden layer node adopts a sigmoid function; the activation function of the output layer node adopts a softmax function; all the layers are connected. The weight w on the node connection is a free parameter and needs to be trained.
The training adopts a two-stage training method, and the process is as follows:
1) pre-training (Pre-training), and initializing each layer weight of the DNN layer by adopting an unsupervised limited Boltzmann Machine (RBM);
the constrained boltzmann machine (RBM) is a generative model. Inspired by an energy functional in statistical mechanics, the RBM introduces an energy function to describe the probability distribution of data. The energy function is a measure describing the state of the whole system. The more ordered or concentrated the probability distribution, the less energy the system. Conversely, the more disordered or evenly distributed the probability distribution, the more energy the system has. The minimum value of the energy function corresponds to the most stable state of the system. The RBM consists of two layers of nodes, a Visible-Layer (Visible-Layer) and a Hidden-Layer (Hidden-Layer). Typically, the visible layer inputs raw data and the hidden layer outputs learned features. By "constrained" is meant that nodes on the same level are not connected, while nodes on different levels are interconnected. Assuming that the visible layer variable is v and the hidden layer variable is h, when the input layer and the hidden layer are both bernoulli distributed, the joint probability distribution p (v, h) of the two can be defined by the energy function E (v, h):
E(v,h)=-Σi∈visibleaivij∈hiddenbjhjijvihjwij (1)
Figure BDA0000968963740000071
wherein wijRepresenting the connection weight of a visible layer node i and a hidden layer node j, wherein vectors a and b respectively represent the deviation of the visible layer and the hidden layer; in the formula (2), Z is a normalized coefficient. And (3) performing edge integration on the variable h by using the joint probability p (v, h) to obtain an observation likelihood probability p (v) of the data, wherein the formula is as follows:
Figure BDA0000968963740000072
starting from maximum likelihood estimation, the criterion function of RBM training is as follows:
wherein w represents a weight parameter, n represents the nth training sample, and the updating value of the weight obtained by optimizing the formula (4) by adopting a gradient descent method is as follows:
ΔWij=<vi,hj>data-<vi,hj>model (5)
where < · > indicates that the variable is desired. The first term, among others, is the expectation of given sample data; while the second term is the expectation of the model itself, which is not directly available. A typical method is by gibbs sampling, and a fast algorithm called contrast Divergence (contrast diversity) can efficiently solve equation (5). The trained RBM weights may be used to initialize DNN: and training the RBMs layer by layer, wherein the hidden layer close to the lower layer of RBMs is output and can be used as the visible layer of the next layer of RBMs, and the hidden layer and the visible layer are sequentially accumulated to reach the set DNN layer number.
2) Fine-tuning (Fine-tuning), and adjusting the initialized network parameters by using an (Error Back-Propagation, EBP) algorithm. The fine tuning phase is a training mode using error back propagation.
The acoustic features of each frame of voice are independently input into the 8 DNNs respectively to generate 8 voice attribute value outputs, the probability of each attribute is identified, and the average value of the probability outputs of all voice frames of a section of voice is calculated according to the formula (6) described below as the final probability of the voice features, namely the classification result, namely the second signal.
Figure BDA0000968963740000082
Where k represents a voice attribute number, in this example k ranges from 1 to 8; n represents the frame number of the speech section; pkn,posRepresenting the probability that the voice attribute k of the nth frame is positive (positive); pk,posIndicating the average of the probabilities of detecting that the speech attribute k of the N frames is positive (positive), i.e., the "positive" output of DNN.
The interactive decision unit 30 takes the second signal as input, makes a decision of the interactive content, and outputs feedback information. The present embodiment defines the decision rule using a binary tree. And setting a threshold value on each node of the binary tree according to the probability of a certain characteristic, if the threshold value is higher than the threshold value, jumping to the left child node, otherwise, jumping to the right child node until the leaf node is reached, and obtaining a decision result.
Under different scenarios, different decision binary trees can be designed. For example, in the scene of song ordering recommendation, the following judgment can be made according to the voice characteristics: firstly, judging whether the child is a child or not, and if the child is a child, selecting the child voice as response voice; then judging whether the voice is male voice or female voice, and if the voice is male voice, selecting the female voice as response voice; then judging whether the patient is angry or not, if not, continuously judging whether the patient is worried or not; if the person is worried, the judgment is continued to judge whether the person has cough, if the person has cough, the basic condition of the speaker can be judged to be a little boy with low interest of cold, and in the case, some more cheerful children songs, such as 'healthy songs', can be recommended. The feedback information may be audio, video or text, depending on the application scenario.
Referring to fig. 2, a flow chart of an interactive method based on voice attribute classification is shown.
First, acoustic features of speech in a first signal are extracted to generate a first signal (step 100). The extracted main acoustic feature information comprises a fundamental frequency signal, an MFCC signal and a formant signal of the voice. In addition, in order to increase the accuracy of classification, signals such as a short-time energy signal, a pitch shake signal, and a harmonic-to-noise ratio are further extracted in this step.
Next, a second signal is generated for the first signal by determining its speech attribute values using a pattern recognition detector trained beforehand using a large amount of labeled data (step 200). In the step, a large amount of acoustic characteristic data trained attribute recognition classifiers are adopted to identify the probability of a certain voice characteristic. The attribute recognition classifier may have a variety of options including Support Vector Machine (SVM), Gaussian Mixture Model (GMM), neural network (ANN), Deep Neural Network (DNN), and the like. The present embodiment selects a deep neural network for voice attribute classification. This embodiment designs 8 independent DNNs for 8 attributes of speech.
Finally, feedback information is output based on the second signal (step 300). The present invention uses a binary tree to define the decision rule. And setting a threshold value for a certain voice attribute value on each node of the binary tree, if the threshold value is higher than the threshold value, jumping to the left child node, otherwise, jumping to the right child node until the leaf node is reached, acquiring a decision result, and outputting feedback information.
In step 100, the method further comprises the steps of performing digital preprocessing and voice endpoint detection on the voice signals, and the processes are used for extracting effective voice signals, reducing interference caused by silence and noise and increasing the calculation amount.
It should be noted that while the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, certain steps may additionally or alternatively be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions. For example, the step 200 of extracting acoustic features and the step 300 of classifying speech attributes may be combined into one step to be performed.
The method has the advantages that a group of acoustic features is extracted, 8 voice attributes are classified at the same time, characteristics and potential information of voice signals are sufficiently mined, and classification of speakers is more detailed, so that more targeted interactive response can be made, and better user experience is obtained. Moreover, the user registration is not needed, and more voice features are used for making up the deficiency of the registration information, so that the use is more convenient and flexible.
In particular, the method described above with reference to fig. 2 may be implemented as a computer software program, according to an embodiment of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method of fig. 2.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a unit, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (14)

1. An interactive system based on voice attribute classification, the system comprising:
an acoustic feature extraction unit configured to extract at least one acoustic feature of each frame of voice of an input voice signal;
a voice attribute classification unit configured to input the at least one acoustic feature of each frame of voice into two or more attribute recognition classifiers, generate outputs of two or more voice attribute values, identify a probability of each attribute, calculate an average value of probability outputs of all voice frames of the voice signal by the two or more attribute recognition classifiers as a final probability of the voice signal, and determine the voice attribute value of the at least one acoustic feature, wherein each attribute recognition classifier outputs one voice attribute value to obtain two or more voice attribute results;
and the interactive decision unit is configured to make a decision of interactive contents based on the more than two voice attribute results and output feedback information.
2. The system of claim 1, wherein the acoustic feature extraction unit comprises a front-end processing unit configured to perform digital preprocessing and voice endpoint detection on the input voice signal.
3. The system according to claim 1, wherein the acoustic feature extraction unit comprises a component configured to extract a fundamental frequency, mel-frequency cepstral coefficients (MFCC), formants of speech.
4. The system according to claim 3, wherein the acoustic feature extraction unit, the acoustic features configured for extraction, further comprises at least one of: short-time energy characteristics, pitch jitter and flicker, harmonic-to-noise ratio.
5. The system of claim 1, wherein the voice attribute classification unit comprises at least one of the following attribute recognition classifiers: a gender attribute identification classifier, an age attribute identification classifier, an emotion attribute identification classifier, and a health attribute identification classifier.
6. The system of claim 1, wherein the attribute recognition classifier employs a Deep Neural Network (DNN) algorithm.
7. The system of claim 6, wherein the operation mode of the attribute recognition classifier is divided into a training mode and a testing mode, wherein the training mode adopts two-stage training, including a pre-training stage and a fine-tuning stage, an unsupervised limited Boltzmann model is adopted in the pre-training stage, and an error back-propagation algorithm is adopted in the fine-tuning stage.
8. An interactive method based on voice attribute classification, the method comprising:
extracting at least one acoustic feature of each frame of voice of an input voice signal;
inputting the at least one acoustic feature of each frame of voice into more than two attribute recognition classifiers, generating output of more than two voice attribute values, identifying the probability of each attribute, calculating the average value of the probability output of all voice frames of the voice signal by the more than two attribute recognition classifiers as the final probability of the voice signal, and determining the voice attribute value of the at least one acoustic feature through attribute recognition classification, wherein each attribute recognition classifier outputs one voice attribute value to obtain more than two voice attribute results;
and making a decision of interactive content based on the more than two voice attribute results, and outputting feedback information.
9. The method of claim 8, wherein extracting the acoustic features of the input speech signal comprises a front-end processing for performing a digital pre-processing and a speech endpoint detection on the input speech signal.
10. The method of claim 8, wherein the acoustic features include a fundamental frequency of speech, mel-frequency cepstral coefficients (MFCCs), formants.
11. The method of claim 10, wherein the acoustic features further comprise at least one of: short-time energy characteristics, pitch jitter and flicker, harmonic-to-noise ratio.
12. The method of claim 8, wherein the voice attribute classification comprises at least one of the following attribute recognition classifications: gender attribute identification classification, age attribute identification classification, emotion attribute identification classification, and health attribute identification classification.
13. The method of claim 8, wherein the attribute recognition classification employs a Deep Neural Network (DNN) algorithm.
14. The method of claim 13, wherein the operation mode of the attribute recognition classification is divided into a training mode and a testing mode, wherein the training mode adopts two-stage training, including a pre-training stage and a fine-tuning stage, an unsupervised limited boltzmann model is adopted in the pre-training stage, and an error back propagation algorithm is adopted in the fine-tuning stage.
CN201610244968.8A 2016-04-19 2016-04-19 Interactive system and method based on voice attribute classification Active CN105761720B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610244968.8A CN105761720B (en) 2016-04-19 2016-04-19 Interactive system and method based on voice attribute classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610244968.8A CN105761720B (en) 2016-04-19 2016-04-19 Interactive system and method based on voice attribute classification

Publications (2)

Publication Number Publication Date
CN105761720A CN105761720A (en) 2016-07-13
CN105761720B true CN105761720B (en) 2020-01-07

Family

ID=56324445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610244968.8A Active CN105761720B (en) 2016-04-19 2016-04-19 Interactive system and method based on voice attribute classification

Country Status (1)

Country Link
CN (1) CN105761720B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106686267A (en) * 2015-11-10 2017-05-17 中国移动通信集团公司 Method and system for implementing personalized voice service
CN107886955B (en) * 2016-09-29 2021-10-26 百度在线网络技术(北京)有限公司 Identity recognition method, device and equipment of voice conversation sample
US10878831B2 (en) * 2017-01-12 2020-12-29 Qualcomm Incorporated Characteristic-based speech codebook selection
CN106898355B (en) * 2017-01-17 2020-04-14 北京华控智加科技有限公司 Speaker identification method based on secondary modeling
CN107316635B (en) * 2017-05-19 2020-09-11 科大讯飞股份有限公司 Voice recognition method and device, storage medium and electronic equipment
WO2019023879A1 (en) * 2017-07-31 2019-02-07 深圳和而泰智能家居科技有限公司 Cough sound recognition method and device, and storage medium
CN107680599A (en) * 2017-09-28 2018-02-09 百度在线网络技术(北京)有限公司 User property recognition methods, device and electronic equipment
CN108132995A (en) * 2017-12-20 2018-06-08 北京百度网讯科技有限公司 For handling the method and apparatus of audio-frequency information
CN107995370B (en) * 2017-12-21 2020-11-24 Oppo广东移动通信有限公司 Call control method, device, storage medium and mobile terminal
CN108109622A (en) * 2017-12-28 2018-06-01 武汉蛋玩科技有限公司 A kind of early education robot voice interactive education system and method
CN108186033B (en) * 2018-01-08 2021-06-25 杭州不亦乐乎健康管理有限公司 Artificial intelligence-based infant emotion monitoring method and system
US10811000B2 (en) * 2018-04-13 2020-10-20 Mitsubishi Electric Research Laboratories, Inc. Methods and systems for recognizing simultaneous speech by multiple speakers
CN109165284B (en) * 2018-08-22 2020-06-16 重庆邮电大学 Financial field man-machine conversation intention identification method based on big data
CN109102805A (en) * 2018-09-20 2018-12-28 北京长城华冠汽车技术开发有限公司 Voice interactive method, device and realization device
CN109065075A (en) * 2018-09-26 2018-12-21 广州势必可赢网络科技有限公司 A kind of method of speech processing, device, system and computer readable storage medium
CN111599342A (en) * 2019-02-21 2020-08-28 北京京东尚科信息技术有限公司 Tone selecting method and system
CN110021308B (en) * 2019-05-16 2021-05-18 北京百度网讯科技有限公司 Speech emotion recognition method and device, computer equipment and storage medium
CN110379441B (en) * 2019-07-01 2020-07-17 特斯联(北京)科技有限公司 Voice service method and system based on countermeasure type artificial intelligence network
CN112530418A (en) * 2019-08-28 2021-03-19 北京声智科技有限公司 Voice wake-up method, device and related equipment
CN110600042B (en) * 2019-10-10 2020-10-23 公安部第三研究所 Method and system for recognizing gender of disguised voice speaker
CN111179915A (en) * 2019-12-30 2020-05-19 苏州思必驰信息科技有限公司 Age identification method and device based on voice
CN111772422A (en) * 2020-06-12 2020-10-16 广州城建职业学院 Intelligent crib
CN113143570B (en) * 2021-04-27 2023-08-11 福州大学 Snore relieving pillow with multiple sensors integrated with feedback adjustment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1107227A2 (en) * 1999-11-30 2001-06-13 Sony Corporation Voice processing
JP2003345385A (en) * 2002-05-30 2003-12-03 Matsushita Electric Ind Co Ltd Voice recognition and discrimination device
CN1564245A (en) * 2004-04-20 2005-01-12 上海上悦通讯技术有限公司 Stunt method and device for baby's crying
CN1975856A (en) * 2006-10-30 2007-06-06 邹采荣 Speech emotion identifying method based on supporting vector machine
CN101201980A (en) * 2007-12-19 2008-06-18 北京交通大学 Remote Chinese language teaching system based on voice affection identification
CN103117060A (en) * 2013-01-18 2013-05-22 中国科学院声学研究所 Modeling approach and modeling system of acoustic model used in speech recognition

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008126627A1 (en) * 2007-03-26 2008-10-23 Nec Corporation Voice analysis device, voice classification method, and voice classification program
US8239196B1 (en) * 2011-07-28 2012-08-07 Google Inc. System and method for multi-channel multi-feature speech/noise classification for noise suppression
CN103546503B (en) * 2012-07-10 2017-03-15 百度在线网络技术(北京)有限公司 Voice-based cloud social intercourse system, method and cloud analysis server

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1107227A2 (en) * 1999-11-30 2001-06-13 Sony Corporation Voice processing
JP2003345385A (en) * 2002-05-30 2003-12-03 Matsushita Electric Ind Co Ltd Voice recognition and discrimination device
CN1564245A (en) * 2004-04-20 2005-01-12 上海上悦通讯技术有限公司 Stunt method and device for baby's crying
CN1975856A (en) * 2006-10-30 2007-06-06 邹采荣 Speech emotion identifying method based on supporting vector machine
CN101201980A (en) * 2007-12-19 2008-06-18 北京交通大学 Remote Chinese language teaching system based on voice affection identification
CN103117060A (en) * 2013-01-18 2013-05-22 中国科学院声学研究所 Modeling approach and modeling system of acoustic model used in speech recognition

Also Published As

Publication number Publication date
CN105761720A (en) 2016-07-13

Similar Documents

Publication Publication Date Title
CN105761720B (en) Interactive system and method based on voice attribute classification
Shahin et al. Emotion recognition using hybrid Gaussian mixture model and deep neural network
Basu et al. A review on emotion recognition using speech
Venkataramanan et al. Emotion recognition from speech
Schuller et al. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture
Friedland et al. Prosodic and other long-term features for speaker diarization
Mannepalli et al. Emotion recognition in speech signals using optimization based multi-SVNN classifier
Tong et al. A comparative study of robustness of deep learning approaches for VAD
JP4220449B2 (en) Indexing device, indexing method, and indexing program
CN105023573A (en) Speech syllable/vowel/phone boundary detection using auditory attention cues
JP6908045B2 (en) Speech processing equipment, audio processing methods, and programs
CN110827857B (en) Speech emotion recognition method based on spectral features and ELM
Samantaray et al. A novel approach of speech emotion recognition with prosody, quality and derived features using SVM classifier for a class of North-Eastern Languages
Ryant et al. Highly accurate mandarin tone classification in the absence of pitch information
Tsenov et al. Speech recognition using neural networks
Archana et al. Gender identification and performance analysis of speech signals
Khan et al. Quranic reciter recognition: a machine learning approach
WO2021171956A1 (en) Speaker identification device, speaker identification method, and program
Alshamsi et al. Automated speech emotion recognition on smart phones
Gomes et al. i-vector algorithm with Gaussian Mixture Model for efficient speech emotion recognition
Hadjadji et al. Emotion recognition in Arabic speech
Nalini et al. Emotion recognition in music signal using AANN and SVM
Palo et al. Emotion Analysis from Speech of Different Age Groups.
Patil et al. Emotion detection from speech using Mfcc & GMM
CN115050387A (en) Multi-dimensional singing playing analysis evaluation method and system in art evaluation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant