CN105761720B

CN105761720B - Interactive system and method based on voice attribute classification

Info

Publication number: CN105761720B
Application number: CN201610244968.8A
Authority: CN
Inventors: 潘复平
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2016-04-19
Filing date: 2016-04-19
Publication date: 2020-01-07
Anticipated expiration: 2036-04-19
Also published as: CN105761720A

Abstract

The application discloses an interactive system and method based on voice attribute classification. The system comprises: the acoustic feature extraction unit is used for extracting acoustic features of an input voice signal and generating a first signal; the voice attribute classification unit is configured for determining a voice attribute value of the first signal through the attribute recognition classifier, outputting a voice attribute result and generating a second signal; and the interaction decision unit is configured to output feedback information based on the second signal. The voice attribute classification unit simultaneously detects multiple voice attributes and outputs corresponding feedback information according to the voice attribute values, so that the interactive process is rich and colorful.

Description

Interactive system and method based on voice attribute classification

Technical Field

The present disclosure relates generally to the field of interaction, and more particularly to human-computer interaction technology, and more particularly to a voice attribute based interaction system.

Background

The conventional human-computer voice interaction process is characterized in that a machine recognizes a voice command sent by a person and then makes a corresponding response according to a recognition result. The content contained in the interaction is limited to the literal meaning of the voice command, the form is single, the user experience is boring, and the method is not suitable for interactive scenes such as toys and home furnishing which need lively and diverse forms.

At present, in human-computer interaction, a voiceprint registration technology is often adopted to judge the identity of a user, so that humanized interaction is realized. In the process of voiceprint registration, a voiceprint recognition technology is used for registering the voice of a user, the identity of the user is associated with the voiceprint, then in the process of using, the voiceprint of a speaker is recognized, then the identity of the speaker is judged according to the voiceprint, and then limited interactive changes are carried out according to the identity of the user. For example, some intelligent toys can judge whether the speaker is dad, mom or baby according to the sound, and the title of the speaker can be changed according to the identity.

The prior art has the disadvantages that on one hand, the traditional technology can only detect one voice attribute, and the change of the interactive content is very limited according to the difference of the voice attributes; voiceprint registration techniques, on the other hand, are cumbersome and inflexible to use.

Disclosure of Invention

In view of the above-mentioned shortcomings or drawbacks of the prior art, it is desirable to provide an interactive system based on voice attribute classification and a method thereof.

In a first aspect, an interactive system based on voice attribute classification is provided, the system including:

the acoustic feature extraction unit is used for extracting acoustic features of an input voice signal and generating a first signal;

the voice attribute classification unit is configured for determining a voice attribute value of the first signal through the attribute recognition classifier, outputting a voice attribute result and generating a second signal;

and the interaction decision unit is configured to output feedback information based on the second signal.

A second aspect provides an interactive method based on voice attribute classification, the method comprising:

extracting acoustic features of an input voice signal to generate a first signal;

the first signal is subjected to attribute identification and classification to determine a voice attribute value of the first signal, a voice attribute result is output, and a second signal is generated;

feedback information is output based on the second signal.

According to the technical scheme provided by the embodiment of the application, the voice attribute classification unit can simultaneously detect various voice attributes of voice and output corresponding feedback information according to each voice attribute value, so that the interactive process is rich and colorful; in addition, the invention can automatically judge the identity of the speaker through voice attribute classification, so that a registration process is not needed, and the method is simple and convenient to use, free and flexible.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a block diagram of an interactive system classified based on voice attributes, according to an embodiment.

FIG. 2 is a flow chart of an interaction method based on voice attribute classification.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

In the voice interaction process, besides the character content of the voice instruction, other voice attributes of the voice can be recognized, and the interactive form and content can be enriched by utilizing the voice attributes. These voice attributes include: age interval, gender, emotion, health degree, etc. of the speaker. Age and gender may be reflected in the fundamental frequency and timbre of speech; emotion can be reflected in stress, intonation, pace and pause of speech; the health degree can be reflected in whether the voice is hoarse, whether the voice is accompanied by cough, whether the voice is accompanied by nasal sound and the like. The same voice attribute of different speakers shows the same distribution rule on voice signals, for example, the fundamental frequency of male voices is low, the spectral energy is mostly concentrated in a low-frequency region, while the fundamental frequency of female voices is high, and the spectral energy is mostly concentrated in a high-frequency region. Based on the characteristics of the voice, a large amount of voice data with the same attribute can be collected, voice attribute marking data capable of reflecting the attribute is extracted, and an attribute recognition classifier is trained to facilitate classification of the voice attribute. Aiming at a plurality of voice attributes, a plurality of attribute recognition classifiers can be trained to carry out classification and judgment respectively. After a series of voice attribute values of voice are obtained, the voice attribute values are used for outputting interactive feedback information in a specific voice interactive scene according to a set interactive decision.

The invention can be applied to interactive scenes of ordering songs, some cheerful songs can be recommended if the emotion of the speaker is identified to be sad, and some mild songs can be recommended if the emotion of the speaker is impatient.

The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to fig. 1, a block diagram of an embodiment of an interactive system based on voice attribute classification is shown, the system comprising:

an acoustic feature extraction unit 10 configured to extract an acoustic feature of an input speech signal and generate a first signal;

a voice attribute classifying unit 20 configured to classify the first signal by voice attribute, output a voice attribute result, and generate a second signal;

and the interaction decision unit 30 is configured to determine the type of interaction based on the second signal and output feedback information. The acoustic feature extraction unit 10 further includes a front-end processing unit configured to perform digital preprocessing and voice endpoint detection on the input voice signal, and the front-end processing unit is mainly responsible for acquiring an effective voice signal, reducing interference caused by silence and noise, and increasing the amount of computation.

The acoustic feature extraction unit 10 extracts a series of acoustic features reflecting the attributes of speech. The extracted acoustic features mainly include:

fundamental frequency: fundamental tone refers to the periodicity caused by vocal cord vibration when voiced sound is emitted, and fundamental frequency is the frequency of vocal cord vibration. The pitch is one of the most important parameters of a speech signal, and can represent information such as emotion, age, and sex included in speech. Accurate detection of the fundamental frequency becomes difficult due to the non-stationarity and non-periodicity of the speech signal and the wide variation range of the pitch period. The present embodiment detects the fundamental frequency using a cepstrum method.

MFCC (mel-frequency cepstrum coefficient): the spectral signature is a short-term signature. In order to extract the spectral features, in order to utilize the characteristics of the human auditory system, the spectrum of the speech signal is generally passed through a band-pass filter whose center frequency is based on the human perceptual scale, and then the spectral features are extracted from the filtered signal, and the embodiment adopts Mel Frequency Cepstral Coefficient (MFCC) features.

Resonance peak: when speaking, the vocal tract is continuously changed and adapted to make the speech clear, and the vocal tract length is also influenced by the emotional state of the speaker. The vocal tract plays a role of resonance when uttering, which causes a resonance characteristic when a vowel is excited into the vocal tract, resulting in a set of resonance frequencies, so-called formant frequencies, referred to as formants for short, which depend on the shape and physical characteristics of the vocal tract. Different vowels correspond to different formant parameters, more formants are used to better describe speech, and the first three signals are generally collected in practical application.

The above three features are basic acoustic features for the present invention, and the present invention can be implemented by classifying voice attributes based on the above acoustic features. In order to achieve better effect, the following acoustic features of the speaker can be further extracted:

short-time energy characteristics: the energy of the voice signal reflects the strength of the voice and has strong direct correlation with emotional information. The short-time energy is calculated from the signal time domain, and the sum of the squares of the signal amplitudes of a frame of speech is calculated.

Pitch jitter and flicker: the jitter refers to the base frequency jitter during the previous and next cycles, i.e. the pitch frequency variation amplitude of the previous and next two frames of speech signals. Flicker refers to energy flicker in two periods before and after, i.e. short-time energy change amplitude of two adjacent frames of speech signals before and after.

Harmonic to noise ratio: the method is characterized in that the ratio of harmonic waves to noise components in a voice signal can reflect emotional changes to a certain extent.

The voice attribute classification unit 20 sets at least one attribute recognition classifier according to the selected voice attribute, each attribute recognition classifier inputs the extracted acoustic features into the attribute recognition classifier by adopting a pattern recognition technology, and the output of the attribute recognition classifier is an attribute detection result. In this embodiment, 8 voice attributes are selected as classification objects, and a gender attribute, an age attribute, an emotion attribute, and a health attribute are respectively detected, which are specifically as follows:

gender attribute:

first voice attribute: for detecting male or female voices;

age attribute:

second voice attribute: for detecting whether a child or an adult;

the emotional property:

third voice attribute: for detecting if gas is generated;

fourth voice attribute: for detecting whether a user is anxious;

fifth voice attribute: for detecting cheerfulness;

health attributes:

sixth voice attribute: for detecting whether a cough has occurred;

seventh voice attribute: for detecting whether a nasal sound is present;

eighth voice attribute: for detecting whether it is sandy or not;

the working modes of the attribute recognition classifier are divided into two types: the first is a training mode, and the second is a testing mode. In the training mode, the attribute recognition classifier learns potential characteristics and rules in the data samples, collects a large number of data samples, manually marks the voice attribute class to which each data sample belongs, inputs the data samples and the corresponding voice attribute class marks into the attribute recognition classifier, and adjusts model parameters in the attribute recognition classifier by adopting a training algorithm. After training is finished, the characteristics of different types of data are reflected on the model parameters of the attribute recognition classifier, and the method can be used for testing new data. In the test mode, the attribute recognition classifier directly classifies the newly acquired data according to the previously learned rule without the step of manual labeling and outputs the classification result.

In this embodiment, a plurality of voice features need to be detected, different attribute recognition classifiers are respectively trained for each voice attribute, each attribute recognition classifier outputs two classes, and the probabilities of "positive" and "negative" of the attribute are identified. For example, male/female is output for gender attribute; outputting children/adults for age attributes; output yes/no, etc. for the health attribute.

The algorithm of the attribute recognition classifier has various choices, including a Support Vector Machine (SVM), a Gaussian Mixture Model (GMM), A Neural Network (ANN), a Deep Neural Network (DNN), etc., and the deep neural network is selected to form the attribute classification unit in this embodiment.

In this embodiment, 8 independent attribute recognition classifiers are designed for 8 attributes of speech, and the attribute recognition classifier adopts a deep neural network algorithm.

Each attribute recognition classifier DNN adopts the same structure: the input layer comprises 51 nodes corresponding to the acoustic features; the method comprises the following steps of (1) containing 4 hidden layers, wherein each hidden layer comprises 512 nodes; the output layer comprises two nodes corresponding to two categories of 'positive' and 'negative' of the voice attribute; the activation function of the hidden layer node adopts a sigmoid function; the activation function of the output layer node adopts a softmax function; all the layers are connected. The weight w on the node connection is a free parameter and needs to be trained.

The training adopts a two-stage training method, and the process is as follows:

1) pre-training (Pre-training), and initializing each layer weight of the DNN layer by adopting an unsupervised limited Boltzmann Machine (RBM);

the constrained boltzmann machine (RBM) is a generative model. Inspired by an energy functional in statistical mechanics, the RBM introduces an energy function to describe the probability distribution of data. The energy function is a measure describing the state of the whole system. The more ordered or concentrated the probability distribution, the less energy the system. Conversely, the more disordered or evenly distributed the probability distribution, the more energy the system has. The minimum value of the energy function corresponds to the most stable state of the system. The RBM consists of two layers of nodes, a Visible-Layer (Visible-Layer) and a Hidden-Layer (Hidden-Layer). Typically, the visible layer inputs raw data and the hidden layer outputs learned features. By "constrained" is meant that nodes on the same level are not connected, while nodes on different levels are interconnected. Assuming that the visible layer variable is v and the hidden layer variable is h, when the input layer and the hidden layer are both bernoulli distributed, the joint probability distribution p (v, h) of the two can be defined by the energy function E (v, h):

E(v,h)＝-Σ_i∈visiblea_iv_i-Σ_j∈hiddenb_jh_j-Σ_ijv_ih_jw_ij (1)

wherein w_ijRepresenting the connection weight of a visible layer node i and a hidden layer node j, wherein vectors a and b respectively represent the deviation of the visible layer and the hidden layer; in the formula (2), Z is a normalized coefficient. And (3) performing edge integration on the variable h by using the joint probability p (v, h) to obtain an observation likelihood probability p (v) of the data, wherein the formula is as follows:

starting from maximum likelihood estimation, the criterion function of RBM training is as follows:

wherein w represents a weight parameter, n represents the nth training sample, and the updating value of the weight obtained by optimizing the formula (4) by adopting a gradient descent method is as follows:

ΔW_ij＝<v_i,h_j>_data-<v_i,h_j>_model (5)

where < · > indicates that the variable is desired. The first term, among others, is the expectation of given sample data; while the second term is the expectation of the model itself, which is not directly available. A typical method is by gibbs sampling, and a fast algorithm called contrast Divergence (contrast diversity) can efficiently solve equation (5). The trained RBM weights may be used to initialize DNN: and training the RBMs layer by layer, wherein the hidden layer close to the lower layer of RBMs is output and can be used as the visible layer of the next layer of RBMs, and the hidden layer and the visible layer are sequentially accumulated to reach the set DNN layer number.

2) Fine-tuning (Fine-tuning), and adjusting the initialized network parameters by using an (Error Back-Propagation, EBP) algorithm. The fine tuning phase is a training mode using error back propagation.

The acoustic features of each frame of voice are independently input into the 8 DNNs respectively to generate 8 voice attribute value outputs, the probability of each attribute is identified, and the average value of the probability outputs of all voice frames of a section of voice is calculated according to the formula (6) described below as the final probability of the voice features, namely the classification result, namely the second signal.

Where k represents a voice attribute number, in this example k ranges from 1 to 8; n represents the frame number of the speech section; p_kn,posRepresenting the probability that the voice attribute k of the nth frame is positive (positive); p_k,posIndicating the average of the probabilities of detecting that the speech attribute k of the N frames is positive (positive), i.e., the "positive" output of DNN.

The interactive decision unit 30 takes the second signal as input, makes a decision of the interactive content, and outputs feedback information. The present embodiment defines the decision rule using a binary tree. And setting a threshold value on each node of the binary tree according to the probability of a certain characteristic, if the threshold value is higher than the threshold value, jumping to the left child node, otherwise, jumping to the right child node until the leaf node is reached, and obtaining a decision result.

Under different scenarios, different decision binary trees can be designed. For example, in the scene of song ordering recommendation, the following judgment can be made according to the voice characteristics: firstly, judging whether the child is a child or not, and if the child is a child, selecting the child voice as response voice; then judging whether the voice is male voice or female voice, and if the voice is male voice, selecting the female voice as response voice; then judging whether the patient is angry or not, if not, continuously judging whether the patient is worried or not; if the person is worried, the judgment is continued to judge whether the person has cough, if the person has cough, the basic condition of the speaker can be judged to be a little boy with low interest of cold, and in the case, some more cheerful children songs, such as 'healthy songs', can be recommended. The feedback information may be audio, video or text, depending on the application scenario.

Referring to fig. 2, a flow chart of an interactive method based on voice attribute classification is shown.

First, acoustic features of speech in a first signal are extracted to generate a first signal (step 100). The extracted main acoustic feature information comprises a fundamental frequency signal, an MFCC signal and a formant signal of the voice. In addition, in order to increase the accuracy of classification, signals such as a short-time energy signal, a pitch shake signal, and a harmonic-to-noise ratio are further extracted in this step.

Next, a second signal is generated for the first signal by determining its speech attribute values using a pattern recognition detector trained beforehand using a large amount of labeled data (step 200). In the step, a large amount of acoustic characteristic data trained attribute recognition classifiers are adopted to identify the probability of a certain voice characteristic. The attribute recognition classifier may have a variety of options including Support Vector Machine (SVM), Gaussian Mixture Model (GMM), neural network (ANN), Deep Neural Network (DNN), and the like. The present embodiment selects a deep neural network for voice attribute classification. This embodiment designs 8 independent DNNs for 8 attributes of speech.

Finally, feedback information is output based on the second signal (step 300). The present invention uses a binary tree to define the decision rule. And setting a threshold value for a certain voice attribute value on each node of the binary tree, if the threshold value is higher than the threshold value, jumping to the left child node, otherwise, jumping to the right child node until the leaf node is reached, acquiring a decision result, and outputting feedback information.

In step 100, the method further comprises the steps of performing digital preprocessing and voice endpoint detection on the voice signals, and the processes are used for extracting effective voice signals, reducing interference caused by silence and noise and increasing the calculation amount.

It should be noted that while the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, certain steps may additionally or alternatively be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions. For example, the step 200 of extracting acoustic features and the step 300 of classifying speech attributes may be combined into one step to be performed.

The method has the advantages that a group of acoustic features is extracted, 8 voice attributes are classified at the same time, characteristics and potential information of voice signals are sufficiently mined, and classification of speakers is more detailed, so that more targeted interactive response can be made, and better user experience is obtained. Moreover, the user registration is not needed, and more voice features are used for making up the deficiency of the registration information, so that the use is more convenient and flexible.

In particular, the method described above with reference to fig. 2 may be implemented as a computer software program, according to an embodiment of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method of fig. 2.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a unit, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. An interactive system based on voice attribute classification, the system comprising:

an acoustic feature extraction unit configured to extract at least one acoustic feature of each frame of voice of an input voice signal;

a voice attribute classification unit configured to input the at least one acoustic feature of each frame of voice into two or more attribute recognition classifiers, generate outputs of two or more voice attribute values, identify a probability of each attribute, calculate an average value of probability outputs of all voice frames of the voice signal by the two or more attribute recognition classifiers as a final probability of the voice signal, and determine the voice attribute value of the at least one acoustic feature, wherein each attribute recognition classifier outputs one voice attribute value to obtain two or more voice attribute results;

and the interactive decision unit is configured to make a decision of interactive contents based on the more than two voice attribute results and output feedback information.

2. The system of claim 1, wherein the acoustic feature extraction unit comprises a front-end processing unit configured to perform digital preprocessing and voice endpoint detection on the input voice signal.

3. The system according to claim 1, wherein the acoustic feature extraction unit comprises a component configured to extract a fundamental frequency, mel-frequency cepstral coefficients (MFCC), formants of speech.

4. The system according to claim 3, wherein the acoustic feature extraction unit, the acoustic features configured for extraction, further comprises at least one of: short-time energy characteristics, pitch jitter and flicker, harmonic-to-noise ratio.

5. The system of claim 1, wherein the voice attribute classification unit comprises at least one of the following attribute recognition classifiers: a gender attribute identification classifier, an age attribute identification classifier, an emotion attribute identification classifier, and a health attribute identification classifier.

6. The system of claim 1, wherein the attribute recognition classifier employs a Deep Neural Network (DNN) algorithm.

7. The system of claim 6, wherein the operation mode of the attribute recognition classifier is divided into a training mode and a testing mode, wherein the training mode adopts two-stage training, including a pre-training stage and a fine-tuning stage, an unsupervised limited Boltzmann model is adopted in the pre-training stage, and an error back-propagation algorithm is adopted in the fine-tuning stage.

8. An interactive method based on voice attribute classification, the method comprising:

extracting at least one acoustic feature of each frame of voice of an input voice signal;

inputting the at least one acoustic feature of each frame of voice into more than two attribute recognition classifiers, generating output of more than two voice attribute values, identifying the probability of each attribute, calculating the average value of the probability output of all voice frames of the voice signal by the more than two attribute recognition classifiers as the final probability of the voice signal, and determining the voice attribute value of the at least one acoustic feature through attribute recognition classification, wherein each attribute recognition classifier outputs one voice attribute value to obtain more than two voice attribute results;

and making a decision of interactive content based on the more than two voice attribute results, and outputting feedback information.

9. The method of claim 8, wherein extracting the acoustic features of the input speech signal comprises a front-end processing for performing a digital pre-processing and a speech endpoint detection on the input speech signal.

10. The method of claim 8, wherein the acoustic features include a fundamental frequency of speech, mel-frequency cepstral coefficients (MFCCs), formants.

11. The method of claim 10, wherein the acoustic features further comprise at least one of: short-time energy characteristics, pitch jitter and flicker, harmonic-to-noise ratio.

12. The method of claim 8, wherein the voice attribute classification comprises at least one of the following attribute recognition classifications: gender attribute identification classification, age attribute identification classification, emotion attribute identification classification, and health attribute identification classification.

13. The method of claim 8, wherein the attribute recognition classification employs a Deep Neural Network (DNN) algorithm.

14. The method of claim 13, wherein the operation mode of the attribute recognition classification is divided into a training mode and a testing mode, wherein the training mode adopts two-stage training, including a pre-training stage and a fine-tuning stage, an unsupervised limited boltzmann model is adopted in the pre-training stage, and an error back propagation algorithm is adopted in the fine-tuning stage.