CN105761720A

CN105761720A - Interaction system based on voice attribute classification, and method thereof

Info

Publication number: CN105761720A
Application number: CN201610244968.8A
Authority: CN
Inventors: 潘复平
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2016-04-19
Filing date: 2016-04-19
Publication date: 2016-07-13
Anticipated expiration: 2036-04-19
Also published as: CN105761720B

Abstract

The invention discloses an interaction system based on voice attribute classification, and a method thereof. The interaction system includes an acoustic feature extraction unit, a voice attribute classification unit and an interaction decision making unit, wherein the acoustic feature extraction unit is configured for extracting the acoustic features of an input voice signal so as to generate a first signal; the voice attribute classification unit is configured for determining the voice attribute value of the first signal through an attribute identification and classification device, and outputting the voice attribute result so as to generate a second signal; the interaction decision making unit is configured for outputting the feedback information based on the second signal; and the voice attribute classification unit can detect various voice attributes at the same time, and can output the corresponding feedback information according to each voice attribute value to enable the interaction process to be rich and colorful.

Description

A kind of interactive system based on voice attributes classification and method thereof

Technical field

The disclosure relates generally to mutual field, is specifically related to human-computer interaction technology, particularly relates to the interactive system based on voice attributes.

Background technology

Conventional man machine language's interaction shows as the phonetic order that machine recognition people sends, and then according to recognition result, makes corresponding reaction.This content comprised alternately is only limitted to the literal meaning of phonetic order, and form is single, and Consumer's Experience is uninteresting, and not being suitable for toy, household etc. needs various interaction scenarios that is lively in form.

At present, in man-machine interaction, judge user identity frequently with voiceprint registration technology, it is achieved hommization mutual.In voiceprint registration process, first with sound groove recognition technology in e, the voice of user is registered, user identity is associated with vocal print, then in use, first identify the vocal print of speaker, then judge the identity of speaker according to vocal print, carry out some limited mutual changes further according to user identity.Such as according to sound, some intelligent toys can judge that current speaker is father, mother or baby, the difference according to identity, thus it is possible to vary the appellation to speaker.

Prior art disadvantageously, on the one hand, conventional art often can only detect a kind of voice attributes, the difference according to voice attributes, and the change that interaction content occurs is also extremely limited；On the other hand, voiceprint registration technology uses very loaded down with trivial details and dumb.

Summary of the invention

In view of drawbacks described above of the prior art or deficiency, it is desirable to provide a kind of interactive system based on voice attributes classification and method thereof.

First aspect, it is proposed to a kind of interactive system based on voice attributes classification, this system includes:

Acoustic features extraction unit, configuration, for extracting the acoustic features of the voice signal of input, generates the first signal；

Voice attributes taxon, configuration determines its voice attributes value for the first signal through attribute identification grader, exports voice attributes result, generates secondary signal；

Interactive decision making unit, configuration is for based on secondary signal output feedack information.

Second aspect provides a kind of exchange method based on voice attributes classification, and the method includes:

Extract the acoustic features of the voice signal of input, generate the first signal；

First signal determines its voice attributes value through attribute identification classification, exports voice attributes result, generates secondary signal；

Based on secondary signal output feedack information.

According to the technical scheme that the embodiment of the present application provides, voice attributes taxon can detect the multiple voice attribute of voice simultaneously, and exports corresponding feedback information according to each voice attributes value, makes interaction flow rich and varied；It addition, the present invention is classified by voice attributes, it is possible to the identity of automatic decision speaker, so not needing registration process, easy to use, freedom and flexibility.

Accompanying drawing explanation

By reading the detailed description that non-limiting example is made made with reference to the following drawings, other features, purpose and advantage will become more apparent upon:

Fig. 1 is the structure chart of a kind of interactive system based on voice attributes classification according to embodiment.

Fig. 2 is the flow chart of a kind of exchange method based on voice attributes classification.

Detailed description of the invention

Below in conjunction with drawings and Examples, the application is described in further detail.It is understood that specific embodiment described herein is used only for explaining related invention, but not the restriction to this invention.It also should be noted that, for the ease of describing, accompanying drawing illustrate only and invent relevant part.

It should be noted that when not conflicting, the embodiment in the application and the feature in embodiment can be mutually combined.

In interactive voice process, except identifying the word content of phonetic order, it is also possible to identify other voice attributes of voice, and utilize form and the content of these voice attributes rich interactives.These voice attributes include: the age range of speaker, sex, emotion, health degree etc..Age and sex can be reflected in fundamental frequency and the tone color of voice；Emotion can be reflected in the stress of voice, tone, word speed and pause；Whether health degree can be reflected in voice hoarse, if with cough, if with in the phenomenons such as rhinophonia.The same voice attributes of different speakers shows the identical regularity of distribution on the voice signal, and the fundamental frequency of such as male voice is relatively low, and spectrum energy focuses mostly at low frequency region, and the fundamental frequency of female voice is higher, and spectrum energy focuses mostly at high-frequency region.Voice-based above feature, it is possible to collect the speech data in a large number with same alike result, extracts the voice attributes labeled data that can reflect this attribute, trains attribute identification grader, it is simple to this voice attributes is classified.For multiple voice attributes, it is possible to train multiple attribute identification grader, carry out respectively classifying and adjudicating.After obtaining a series of voice attributes values of voice, with they in special sound interaction scenarios according to set interactive decision making, export mutual feedback information.

The present invention can be applied in the interaction scenarios of requesting song, if the emotion identifying speaker is sadder, it is possible to recommend the song that some are cheerful and light-hearted, if human feelings thread of speaking is more irritable, it is possible to recommend the song that some are gentle.

Describe the application below with reference to the accompanying drawings and in conjunction with the embodiments in detail.

Refer to Fig. 1, provide the structure chart of a kind of embodiment of interactive system based on voice attributes classification, this system includes:

Acoustic features extraction unit 10, configuration, for extracting the acoustic features of the voice signal of input, generates the first signal；

Voice attributes taxon 20, configuration, for the first signal is classified through voice attributes, exports voice attributes result, generates secondary signal；

Interactive decision making unit 30, configuration is for judging mutual type, output feedack information based on secondary signal.Acoustic features extraction unit 10 also includes front-end processing unit, configuration is for being digitized pretreatment and speech terminals detection to the voice signal of input, front-end processing unit, primary responsibility obtains effective voice signal, reduces the increase of interference noiseless, that noise brings and amount of calculation.

Acoustic features extraction unit 10 extracts a series of acoustic featuress of reflection voice attributes.The acoustic features extracted specifically includes that

Fundamental frequency: fundamental tone refers to periodicity when sending out voiced sound caused by vocal cord vibration, and fundamental frequency is exactly the frequency of vocal cord vibration.Fundamental tone is one of most important parameter of voice signal, can embody comprise in voice emotion, the age, the information such as sex.Due to the non-stationary of voice signal and aperiodicity, and the excursion of pitch period is very wide, makes the accurately detection of fundamental frequency become highly difficult.The present embodiment uses Cepstrum Method detection fundamental frequency.

MFCC (mel-frequency cepstrum coefficient): spectrum signature is short-time characteristic.When extracting spectrum signature, in order to utilize the auditory system feature of the mankind, generally the frequency spectrum of voice signal is passed through the mid frequency band filter based on human perception yardstick, then from these by extracting spectrum signature the signal of filtering, the present embodiment adopts Mel frequency cepstral coefficient (MFCC) feature.

Formant: when speaking, sound channel can constantly change adaptation makes language clear, and sound channel length is also affected by the impact of speaker's emotional state simultaneously.During pronunciation, sound channel role is sympathetic response effect, can cause resonance characteristics when vowel excitation enters sound channel, produce one group of resonant frequency, it is simply that so-called formant frequency, be called for short formant, and they depend on shape and the physical features of sound channel.Different vowels correspond to different formant parameters, uses more many formants can better describe voice, general first three signal of collection in practical application.

Above three is characterized by the basic acoustic features for the present invention, is capable of the present invention based on above-mentioned acoustic features through voice attributes classification.In order to reach more excellent effect, it is possible to extract the following acoustic features of speaker further:

Short-time energy feature: the energy of voice signal reflects the intensity of voice, has stronger directly related property with emotional information.Short-time energy is calculated from signal time domain, and it calculates the signal amplitude quadratic sum of a frame voice.

Pitch jitter and flicker: shake refers to the fundamental frequency shake during the week of front and back, the fundamental frequency amplitude of variation of two frame voice signals before and after namely.Flicker refers to the energy flicker during former and later two weeks, the short-time energy amplitude of variation of adjacent two frame voice signals before and after namely.

Harmonic to noise ratio: as the term suggests referring to harmonic wave and the ratio of noise contribution in voice signal, the change of emotion can be reflected to a certain extent.

Voice attributes taxon 20, according to the voice attributes selected, arranges at least one attribute identification grader, each attribute identification grader, adopt mode identification technology, the acoustic features of said extracted is input in attribute identification grader, the output of attribute identification grader and detection of attribute result.The present embodiment selects 8 kinds of voice attributes as object of classification, detects gender attribute, age attribute, emotion attribute, healthy attribute respectively, specific as follows:

Gender attribute:

First voice attributes: be used for detecting male voice or female voice；

Age attribute:

Second voice attributes: be used for detecting child and be still grown up；

Emotion attribute:

3rd voice attributes: for detecting whether angry；

4th voice attributes: for detecting whether sad；

5th voice attributes: for detecting whether cheerful and light-hearted；

Healthy attribute:

6th voice attributes: for detecting whether cough；

7th voice attributes: for detecting whether rhinophonia；

8th voice attributes: for detecting whether hoarse；

The mode of operation of attribute identification grader is divided into two kinds: one to be training mode, and two is test pattern.Under training mode, potential feature and rule in attribute identification grader learning data sample, collect mass data sample, manually mark the voice attributes classification belonging to each data sample simultaneously, data sample and corresponding voice attributes classification mark are input in attribute identification grader, adopt training algorithm, the model parameter in attribute identification grader is adjusted.After having trained, the feature of different classes of data is all reflected in attribute identification sorter model parameter, it is possible to for the test to new data.Under test pattern, the new data gathered are made directly classification according to the rule learnt before by attribute identification grader, it is not necessary to the step of artificial mark, output category result.

In the present embodiment, it is necessary to detection multiple voice feature, being respectively trained different attribute identification graders for each voice attributes, each attribute identification grader exports two classification, identifies the probability of this attribute " just " and negation.Such as, male/female is exported for gender attribute；Child/adult is exported for age attribute；For healthy attribute output Yes/No etc..

The algorithm of attribute identification grader has multiple choices, including support vector machine (SVM), mixed Gauss model (GMM), neutral net (ANN), deep neural network (DNN) etc., the present embodiment selects deep neural network to constitute attributive classification unit.

The present embodiment is 8 independent attribute identification graders to 8 attribute designs of voice, and this attribute identification grader adopts deep neural network algorithm.

Each attribute identification grader DNN adopts identical structure: input layer comprises 51 nodes, corresponding above-mentioned acoustic features；Comprising 4 hidden layers, each hidden layer has 512 nodes；Output layer comprises two nodes, two classifications of " just " negation of corresponding voice attributes；The activation primitive of hidden layer node adopts sigmoid function；The activation primitive of output layer node adopts softmax function；Full connection is adopted between each layer.The weight w that node connects is free parameter, it is necessary to training obtains.

Training adopts two-part training method, and process is as follows:

1) pre-training (Pre-training), adopts and without the limited Boltzmann machine of supervision (RestrictedBoltzmannMachine, RBM), each layer weight of DNN is successively initialized；

Limited Boltzmann machine (RBM) is a kind of production model.By the inspiration of energy functional in statistical mechanics, RBM introduces energy function to describe the probability distribution of data.Energy function is that the one describing whole system state is estimated.System is more orderly or probability distribution is more concentrated, and the energy of system is more little.Otherwise, system is more unordered or probability distribution more tends to being uniformly distributed, then the energy of system is more big.The minima of energy function, corresponding to the most steady statue of system.RBM is by comprising two-layer node, it is seen that layer (Visible-Layer) and hidden layer (Hidden-Layer).Generally, it is seen that layer input initial data, hidden layer output is the feature learning to arrive.Here so-called " limited " each node referring to same layer is without connection, and the node interconnection between different layers.Assuming that visible layer variable is v, hidden layer variable is h, when input layer and hidden layer be all Bernoulli Jacob be distributed time, both joint probability distribution p (v, h) can pass through energy function E (v, h) definition:

E (v, h)=-Σ_i∈visiblea_iv_i-Σ_j∈hiddenb_jh_j-Σ_ijv_ih_jw_ij(1)

w_{i j} P (v, h) = \frac{1}{Z} e^{- E (v, h)} - - - (2)

Wherein w_ijRepresenting the weights that couple of visible layer node i and hidden layer node j, vector a, b represent the deviation of visible layer and hidden layer respectively；In formula (2), Z is normalization coefficient.By joint probability p (v, h) takes edge integration to variable h and obtains observation likelihood probability p (v) of data, as shown in formula (3):

P (v) = Σ_{h} P (v, h) = \frac{1}{Z} Σ_{h} e^{- E (v, h)} - - - (3)

From maximal possibility estimation, the criterion function of RBM training is:

W^{*} = \arg \max L (w) = {argmaxΣ}_{n = 1}^{N} \log p (v^{n} | w) - - - (4)

Wherein w represents that weight parameter, n represent the n-th training sample, adopts gradient descent method that formula (4) is optimized the updated value obtaining weights and is:

ΔW_ij=< v_i,h_j>_data-<v_i,h_j>_model(5)

This variable is taken expectation by<>therein expression.Wherein Section 1, is the expectation seeking given sample data；And Section 2 is the expectation of modulus type itself, the expectation of model is not directly available.Typical method is to be obtained by gibbs sampler, and one is called formula (5) can be solved by sdpecific dispersion (ContrastiveDivergence) fast algorithm efficiently.The RBM weight trained can be used to DNN is initialized: successively trains RBM, near low layer RBM hidden layer output can as it visible layer of next layer of RBM, thus adding up successively, reach set the DNN number of plies.

2) fine setting (Fine-tuning), adopts (ErrorBack-Propagation, EBP) algorithm that the network parameter initialized is adjusted.The fine setting stage is the training method of the error back propagation adopted.

The acoustic features of every frame voice is independently input in these 8 DNN, produce the output of 8 voice attributes values, identify the probability of each attribute, the meansigma methods of probability output of one section of all speech frame of voice is calculated as the final probability of this section of phonetic feature, i.e. classification results and secondary signal according to the formula (6) that is described below.

P_{k, p o s} = \frac{1}{Z} Σ_{n} P_{k n, p o s} - - - (6)

Wherein, k represents that voice attributes is numbered, and in this example, the scope of k is from 1 to 8；N represents the frame number of this voice segments；P_kn,posRepresent the probability that n-th frame voice attributes k is positive (positive)；P_k,posRepresent the probability average that the voice attributes k detecting this N number of frame is positive (positive), i.e. " just " output of DNN.

Secondary signal as input, is made the decision-making of interaction content, output feedack information by interactive decision making unit 30.The present embodiment uses binary tree definition decision rules.On each node of binary tree, the probability for a certain feature sets a threshold value, if above this threshold value, then jumps child node of turning left, otherwise then jumps child node of turning right, until it reaches leaf node, obtain the result of decision.

Under different scenes, it is possible to design different decision Binary Trees.For example, recommend, under this scene, following judgement to be done according to phonetic feature in requesting song: first determine whether to be child, if child, select child's voice as response voice；Then judge it is male voice or female voice, if male voice, select young girl's sound as response voice；Then judge to be anger, if not anger, whether sad continue judgement；If sad, continuing to determine whether cough, if there being cough, then can be determined that the basic condition of speaker is one " spadger that the excitement of flu is not high ", in such a case, it is possible to recommend some more cheerful and more light-hearted nursery rhymes, such as " healthy song ".This feedback information difference according to application scenarios, it is possible to be audio frequency, video or word.

Refer to Fig. 2, provide a kind of exchange method flow chart based on voice attributes classification.

First, extract the acoustic features of voice in the first signal, generate the first signal (step 100).The primary acoustic characteristic information extracted has the fundamental frequency signal of voice, MFCC signal, formant signal.It addition, for the accuracy increasing classification, extract the signals such as short-time energy signal, pitch jitter signal, harmonic to noise ratio on the basis of the above further in this step.

Secondly, the first signal is determined its voice attributes value by the pattern recognition detector trained with a large amount of labeled data in advance, generate secondary signal (step 200).The attribute identification grader adopting a large amount of acoustic features data to train in this step, identifies the probability that a certain phonetic feature occurs.Attribute identification grader can have multiple choices, including support vector machine (SVM), mixed Gauss model (GMM), neutral net (ANN), deep neural network (DNN) etc..The present embodiment selects deep neural network for voice attributes classification.The present embodiment is for 8 independent DNN of 8 attribute designs of voice.

Finally, based on secondary signal output feedack information (step 300).The present invention uses binary tree definition decision rules.On each node of binary tree, one threshold value is set for a certain voice attributes value, if above this threshold value, then jumps child node of turning left, otherwise then jump child node of turning right, until it reaches leaf node, obtain the result of decision, output feedack information.

In step 100, also including voice signal is digitized pretreatment and speech terminals detection, this process is to extract effective voice signal, reduces the increase of interference noiseless, that noise brings and amount of calculation.

Although it should be noted that, describe the operation of the inventive method in the accompanying drawings with particular order, but, this does not require that or implies and must operate to perform these according to this particular order, or having to carry out all shown operation could realize desired result.On the contrary, additionally or alternatively, it is convenient to omit some step, multiple steps are merged into a step and performs, and/or a step is decomposed into the execution of multiple step.Such as, the step 300 of the step 200 and voice attributes classification of extracting acoustic features can be merged into a step and perform.

Having the beneficial effects that of the present embodiment, extract one group of acoustic features to classify 8 kinds of voice attributes simultaneously, the feature of voice signal and potential information are excavated ground more abundant, the classification of speaker is more careful, therefore can make and have more cross reaction targetedly, obtain better Consumer's Experience.And, it is not necessary to user registers, but makes up the disappearance of log-on message with more phonetic feature so that with convenient flexibly.

Especially, according to embodiment of the disclosure, may be implemented as computer software programs above with reference to Fig. 2 method described.Such as, embodiment of the disclosure and include a kind of computer program, it includes the computer program being tangibly embodied on machine readable media, and described computer program comprises the program code of the method for performing Fig. 2.

Flow chart in accompanying drawing and block diagram, it is illustrated that according to the system of various embodiments of the invention, the architectural framework in the cards of method and computer program product, function and operation.In this, flow chart or each square frame in block diagram can represent a part for a unit, program segment or code, and a part for described unit, program segment or code comprises the executable instruction of one or more logic function for realizing regulation.It should also be noted that at some as in the realization replaced, the function marked in square frame can also to be different from the order generation marked in accompanying drawing.Such as, two square frames succeedingly represented can essentially perform substantially in parallel, and they can also perform sometimes in the opposite order, and this determines according to involved function.It will also be noted that, the combination of the square frame in each square frame in block diagram and/or flow chart and block diagram and/or flow chart, can realize by the special hardware based system of the function or operation that perform regulation, or can realize with the combination of specialized hardware Yu computer instruction.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Skilled artisan would appreciate that, invention scope involved in the application, it is not limited to the technical scheme of the particular combination of above-mentioned technical characteristic, when also should be encompassed in without departing from described inventive concept simultaneously, other technical scheme being carried out combination in any by above-mentioned technical characteristic or its equivalent feature and being formed.Such as features described above and (but not limited to) disclosed herein have the technical characteristic of similar functions and replace mutually and the technical scheme that formed.

Claims

1., based on an interactive system for voice attributes classification, described system includes:

Voice attributes taxon, configuration determines its voice attributes value for the first signal through attributive classification device, exports voice attributes result, generates secondary signal；

2. system according to claim 1, it is characterised in that described acoustic features extraction unit includes leading portion processing unit, the configuration of described leading portion processing unit is for being digitized pretreatment and speech terminals detection to the voice signal of input.

3. system according to claim 1, it is characterised in that described acoustic features extraction unit includes, configuration is for extracting the fundamental frequency of voice, mel-frequency cepstrum coefficient (MFCC), formant.

4. system according to claim 3, it is characterised in that described acoustic features extraction unit, configuration also includes at least with the next item down for the described acoustic features extracted: short-time energy feature, pitch jitter and flicker, harmonic to noise ratio.

5. system according to claim 1, it is characterised in that institute's speech attribute taxon, including at least following a kind of attribute identification grader: gender attribute recognition classifier, age attribute recognition classifier, emotion attribute identification grader, healthy attribute identification grader.

6. system according to claim 1, it is characterised in that described attribute identification grader adopts deep neural network (DNN) algorithm.

7. system according to claim 6, it is characterized in that, the mode of operation of described attribute identification grader is divided into training mode and test pattern, wherein training mode adopts two-part training, including pre-training stage and fine setting stage, adopt without supervising limited Boltzmann machine model in the pre-training stage, adopt error backpropagation algorithm in the fine setting stage.

8., based on an exchange method for voice attributes classification, described method includes:

Based on secondary signal output feedack information.

9. method according to claim 8, it is characterised in that extracted the acoustic features of the voice signal of input, processes including leading portion, and described leading portion processes for the voice signal of input is digitized pretreatment and speech terminals detection.

10. method according to claim 8, it is characterised in that described acoustic features includes, the fundamental frequency of voice, mel-frequency cepstrum coefficient (MFCC), formant.

11. method according to claim 10, it is characterised in that described acoustic features also includes at least with the next item down: short-time energy feature, pitch jitter and flicker, harmonic to noise ratio.

12. method according to claim 8, it is characterised in that described secondary signal is classified through voice attributes, including at least following a kind of attribute identification classification: recognition property discriminator, age attribute discriminator, emotion attribute identification is classified, healthy attribute identification classification.

13. method according to claim 8, it is characterised in that the classification of described attribute identification adopts deep neural network (DNN) algorithm.

14. method according to claim 13, it is characterized in that, the mode of operation of described attribute identification classification is divided into training mode and test pattern, wherein training mode adopts two-part training, including pre-training stage and fine setting stage, adopt without supervising limited Boltzmann machine model in the pre-training stage, adopt error backpropagation algorithm in the fine setting stage.