CN105761720A - Interaction system based on voice attribute classification, and method thereof - Google Patents
Interaction system based on voice attribute classification, and method thereof Download PDFInfo
- Publication number
- CN105761720A CN105761720A CN201610244968.8A CN201610244968A CN105761720A CN 105761720 A CN105761720 A CN 105761720A CN 201610244968 A CN201610244968 A CN 201610244968A CN 105761720 A CN105761720 A CN 105761720A
- Authority
- CN
- China
- Prior art keywords
- voice
- signal
- classification
- acoustic features
- attribute
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 230000003993 interaction Effects 0.000 title abstract description 15
- 238000000605 extraction Methods 0.000 claims abstract description 10
- 230000008569 process Effects 0.000 claims abstract description 8
- 238000012549 training Methods 0.000 claims description 21
- 230000002452 interceptive effect Effects 0.000 claims description 14
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 230000008451 emotion Effects 0.000 claims description 9
- 238000001514 detection method Methods 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 description 13
- 238000005516 engineering process Methods 0.000 description 7
- 238000001228 spectrum Methods 0.000 description 6
- 206010011224 Cough Diseases 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002996 emotional effect Effects 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 230000007935 neutral effect Effects 0.000 description 2
- 210000001260 vocal cord Anatomy 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002889 sympathetic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an interaction system based on voice attribute classification, and a method thereof. The interaction system includes an acoustic feature extraction unit, a voice attribute classification unit and an interaction decision making unit, wherein the acoustic feature extraction unit is configured for extracting the acoustic features of an input voice signal so as to generate a first signal; the voice attribute classification unit is configured for determining the voice attribute value of the first signal through an attribute identification and classification device, and outputting the voice attribute result so as to generate a second signal; the interaction decision making unit is configured for outputting the feedback information based on the second signal; and the voice attribute classification unit can detect various voice attributes at the same time, and can output the corresponding feedback information according to each voice attribute value to enable the interaction process to be rich and colorful.
Description
Technical field
The disclosure relates generally to mutual field, is specifically related to human-computer interaction technology, particularly relates to the interactive system based on voice attributes.
Background technology
Conventional man machine language's interaction shows as the phonetic order that machine recognition people sends, and then according to recognition result, makes corresponding reaction.This content comprised alternately is only limitted to the literal meaning of phonetic order, and form is single, and Consumer's Experience is uninteresting, and not being suitable for toy, household etc. needs various interaction scenarios that is lively in form.
At present, in man-machine interaction, judge user identity frequently with voiceprint registration technology, it is achieved hommization mutual.In voiceprint registration process, first with sound groove recognition technology in e, the voice of user is registered, user identity is associated with vocal print, then in use, first identify the vocal print of speaker, then judge the identity of speaker according to vocal print, carry out some limited mutual changes further according to user identity.Such as according to sound, some intelligent toys can judge that current speaker is father, mother or baby, the difference according to identity, thus it is possible to vary the appellation to speaker.
Prior art disadvantageously, on the one hand, conventional art often can only detect a kind of voice attributes, the difference according to voice attributes, and the change that interaction content occurs is also extremely limited;On the other hand, voiceprint registration technology uses very loaded down with trivial details and dumb.
Summary of the invention
In view of drawbacks described above of the prior art or deficiency, it is desirable to provide a kind of interactive system based on voice attributes classification and method thereof.
First aspect, it is proposed to a kind of interactive system based on voice attributes classification, this system includes:
Acoustic features extraction unit, configuration, for extracting the acoustic features of the voice signal of input, generates the first signal;
Voice attributes taxon, configuration determines its voice attributes value for the first signal through attribute identification grader, exports voice attributes result, generates secondary signal;
Interactive decision making unit, configuration is for based on secondary signal output feedack information.
Second aspect provides a kind of exchange method based on voice attributes classification, and the method includes:
Extract the acoustic features of the voice signal of input, generate the first signal;
First signal determines its voice attributes value through attribute identification classification, exports voice attributes result, generates secondary signal;
Based on secondary signal output feedack information.
According to the technical scheme that the embodiment of the present application provides, voice attributes taxon can detect the multiple voice attribute of voice simultaneously, and exports corresponding feedback information according to each voice attributes value, makes interaction flow rich and varied;It addition, the present invention is classified by voice attributes, it is possible to the identity of automatic decision speaker, so not needing registration process, easy to use, freedom and flexibility.
Accompanying drawing explanation
By reading the detailed description that non-limiting example is made made with reference to the following drawings, other features, purpose and advantage will become more apparent upon:
Fig. 1 is the structure chart of a kind of interactive system based on voice attributes classification according to embodiment.
Fig. 2 is the flow chart of a kind of exchange method based on voice attributes classification.
Detailed description of the invention
Below in conjunction with drawings and Examples, the application is described in further detail.It is understood that specific embodiment described herein is used only for explaining related invention, but not the restriction to this invention.It also should be noted that, for the ease of describing, accompanying drawing illustrate only and invent relevant part.
It should be noted that when not conflicting, the embodiment in the application and the feature in embodiment can be mutually combined.
In interactive voice process, except identifying the word content of phonetic order, it is also possible to identify other voice attributes of voice, and utilize form and the content of these voice attributes rich interactives.These voice attributes include: the age range of speaker, sex, emotion, health degree etc..Age and sex can be reflected in fundamental frequency and the tone color of voice;Emotion can be reflected in the stress of voice, tone, word speed and pause;Whether health degree can be reflected in voice hoarse, if with cough, if with in the phenomenons such as rhinophonia.The same voice attributes of different speakers shows the identical regularity of distribution on the voice signal, and the fundamental frequency of such as male voice is relatively low, and spectrum energy focuses mostly at low frequency region, and the fundamental frequency of female voice is higher, and spectrum energy focuses mostly at high-frequency region.Voice-based above feature, it is possible to collect the speech data in a large number with same alike result, extracts the voice attributes labeled data that can reflect this attribute, trains attribute identification grader, it is simple to this voice attributes is classified.For multiple voice attributes, it is possible to train multiple attribute identification grader, carry out respectively classifying and adjudicating.After obtaining a series of voice attributes values of voice, with they in special sound interaction scenarios according to set interactive decision making, export mutual feedback information.
The present invention can be applied in the interaction scenarios of requesting song, if the emotion identifying speaker is sadder, it is possible to recommend the song that some are cheerful and light-hearted, if human feelings thread of speaking is more irritable, it is possible to recommend the song that some are gentle.
Describe the application below with reference to the accompanying drawings and in conjunction with the embodiments in detail.
Refer to Fig. 1, provide the structure chart of a kind of embodiment of interactive system based on voice attributes classification, this system includes:
Acoustic features extraction unit 10, configuration, for extracting the acoustic features of the voice signal of input, generates the first signal;
Voice attributes taxon 20, configuration, for the first signal is classified through voice attributes, exports voice attributes result, generates secondary signal;
Interactive decision making unit 30, configuration is for judging mutual type, output feedack information based on secondary signal.Acoustic features extraction unit 10 also includes front-end processing unit, configuration is for being digitized pretreatment and speech terminals detection to the voice signal of input, front-end processing unit, primary responsibility obtains effective voice signal, reduces the increase of interference noiseless, that noise brings and amount of calculation.
Acoustic features extraction unit 10 extracts a series of acoustic featuress of reflection voice attributes.The acoustic features extracted specifically includes that
Fundamental frequency: fundamental tone refers to periodicity when sending out voiced sound caused by vocal cord vibration, and fundamental frequency is exactly the frequency of vocal cord vibration.Fundamental tone is one of most important parameter of voice signal, can embody comprise in voice emotion, the age, the information such as sex.Due to the non-stationary of voice signal and aperiodicity, and the excursion of pitch period is very wide, makes the accurately detection of fundamental frequency become highly difficult.The present embodiment uses Cepstrum Method detection fundamental frequency.
MFCC (mel-frequency cepstrum coefficient): spectrum signature is short-time characteristic.When extracting spectrum signature, in order to utilize the auditory system feature of the mankind, generally the frequency spectrum of voice signal is passed through the mid frequency band filter based on human perception yardstick, then from these by extracting spectrum signature the signal of filtering, the present embodiment adopts Mel frequency cepstral coefficient (MFCC) feature.
Formant: when speaking, sound channel can constantly change adaptation makes language clear, and sound channel length is also affected by the impact of speaker's emotional state simultaneously.During pronunciation, sound channel role is sympathetic response effect, can cause resonance characteristics when vowel excitation enters sound channel, produce one group of resonant frequency, it is simply that so-called formant frequency, be called for short formant, and they depend on shape and the physical features of sound channel.Different vowels correspond to different formant parameters, uses more many formants can better describe voice, general first three signal of collection in practical application.
Above three is characterized by the basic acoustic features for the present invention, is capable of the present invention based on above-mentioned acoustic features through voice attributes classification.In order to reach more excellent effect, it is possible to extract the following acoustic features of speaker further:
Short-time energy feature: the energy of voice signal reflects the intensity of voice, has stronger directly related property with emotional information.Short-time energy is calculated from signal time domain, and it calculates the signal amplitude quadratic sum of a frame voice.
Pitch jitter and flicker: shake refers to the fundamental frequency shake during the week of front and back, the fundamental frequency amplitude of variation of two frame voice signals before and after namely.Flicker refers to the energy flicker during former and later two weeks, the short-time energy amplitude of variation of adjacent two frame voice signals before and after namely.
Harmonic to noise ratio: as the term suggests referring to harmonic wave and the ratio of noise contribution in voice signal, the change of emotion can be reflected to a certain extent.
Voice attributes taxon 20, according to the voice attributes selected, arranges at least one attribute identification grader, each attribute identification grader, adopt mode identification technology, the acoustic features of said extracted is input in attribute identification grader, the output of attribute identification grader and detection of attribute result.The present embodiment selects 8 kinds of voice attributes as object of classification, detects gender attribute, age attribute, emotion attribute, healthy attribute respectively, specific as follows:
Gender attribute:
First voice attributes: be used for detecting male voice or female voice;
Age attribute:
Second voice attributes: be used for detecting child and be still grown up;
Emotion attribute:
3rd voice attributes: for detecting whether angry;
4th voice attributes: for detecting whether sad;
5th voice attributes: for detecting whether cheerful and light-hearted;
Healthy attribute:
6th voice attributes: for detecting whether cough;
7th voice attributes: for detecting whether rhinophonia;
8th voice attributes: for detecting whether hoarse;
The mode of operation of attribute identification grader is divided into two kinds: one to be training mode, and two is test pattern.Under training mode, potential feature and rule in attribute identification grader learning data sample, collect mass data sample, manually mark the voice attributes classification belonging to each data sample simultaneously, data sample and corresponding voice attributes classification mark are input in attribute identification grader, adopt training algorithm, the model parameter in attribute identification grader is adjusted.After having trained, the feature of different classes of data is all reflected in attribute identification sorter model parameter, it is possible to for the test to new data.Under test pattern, the new data gathered are made directly classification according to the rule learnt before by attribute identification grader, it is not necessary to the step of artificial mark, output category result.
In the present embodiment, it is necessary to detection multiple voice feature, being respectively trained different attribute identification graders for each voice attributes, each attribute identification grader exports two classification, identifies the probability of this attribute " just " and negation.Such as, male/female is exported for gender attribute;Child/adult is exported for age attribute;For healthy attribute output Yes/No etc..
The algorithm of attribute identification grader has multiple choices, including support vector machine (SVM), mixed Gauss model (GMM), neutral net (ANN), deep neural network (DNN) etc., the present embodiment selects deep neural network to constitute attributive classification unit.
The present embodiment is 8 independent attribute identification graders to 8 attribute designs of voice, and this attribute identification grader adopts deep neural network algorithm.
Each attribute identification grader DNN adopts identical structure: input layer comprises 51 nodes, corresponding above-mentioned acoustic features;Comprising 4 hidden layers, each hidden layer has 512 nodes;Output layer comprises two nodes, two classifications of " just " negation of corresponding voice attributes;The activation primitive of hidden layer node adopts sigmoid function;The activation primitive of output layer node adopts softmax function;Full connection is adopted between each layer.The weight w that node connects is free parameter, it is necessary to training obtains.
Training adopts two-part training method, and process is as follows:
1) pre-training (Pre-training), adopts and without the limited Boltzmann machine of supervision (RestrictedBoltzmannMachine, RBM), each layer weight of DNN is successively initialized;
Limited Boltzmann machine (RBM) is a kind of production model.By the inspiration of energy functional in statistical mechanics, RBM introduces energy function to describe the probability distribution of data.Energy function is that the one describing whole system state is estimated.System is more orderly or probability distribution is more concentrated, and the energy of system is more little.Otherwise, system is more unordered or probability distribution more tends to being uniformly distributed, then the energy of system is more big.The minima of energy function, corresponding to the most steady statue of system.RBM is by comprising two-layer node, it is seen that layer (Visible-Layer) and hidden layer (Hidden-Layer).Generally, it is seen that layer input initial data, hidden layer output is the feature learning to arrive.Here so-called " limited " each node referring to same layer is without connection, and the node interconnection between different layers.Assuming that visible layer variable is v, hidden layer variable is h, when input layer and hidden layer be all Bernoulli Jacob be distributed time, both joint probability distribution p (v, h) can pass through energy function E (v, h) definition:
E (v, h)=-Σi∈visibleaivi-Σj∈hiddenbjhj-Σijvihjwij(1)
Wherein wijRepresenting the weights that couple of visible layer node i and hidden layer node j, vector a, b represent the deviation of visible layer and hidden layer respectively;In formula (2), Z is normalization coefficient.By joint probability p (v, h) takes edge integration to variable h and obtains observation likelihood probability p (v) of data, as shown in formula (3):
From maximal possibility estimation, the criterion function of RBM training is:
Wherein w represents that weight parameter, n represent the n-th training sample, adopts gradient descent method that formula (4) is optimized the updated value obtaining weights and is:
ΔWij=< vi,hj>data-<vi,hj>model(5)
This variable is taken expectation by<>therein expression.Wherein Section 1, is the expectation seeking given sample data;And Section 2 is the expectation of modulus type itself, the expectation of model is not directly available.Typical method is to be obtained by gibbs sampler, and one is called formula (5) can be solved by sdpecific dispersion (ContrastiveDivergence) fast algorithm efficiently.The RBM weight trained can be used to DNN is initialized: successively trains RBM, near low layer RBM hidden layer output can as it visible layer of next layer of RBM, thus adding up successively, reach set the DNN number of plies.
2) fine setting (Fine-tuning), adopts (ErrorBack-Propagation, EBP) algorithm that the network parameter initialized is adjusted.The fine setting stage is the training method of the error back propagation adopted.
The acoustic features of every frame voice is independently input in these 8 DNN, produce the output of 8 voice attributes values, identify the probability of each attribute, the meansigma methods of probability output of one section of all speech frame of voice is calculated as the final probability of this section of phonetic feature, i.e. classification results and secondary signal according to the formula (6) that is described below.
Wherein, k represents that voice attributes is numbered, and in this example, the scope of k is from 1 to 8;N represents the frame number of this voice segments;Pkn,posRepresent the probability that n-th frame voice attributes k is positive (positive);Pk,posRepresent the probability average that the voice attributes k detecting this N number of frame is positive (positive), i.e. " just " output of DNN.
Secondary signal as input, is made the decision-making of interaction content, output feedack information by interactive decision making unit 30.The present embodiment uses binary tree definition decision rules.On each node of binary tree, the probability for a certain feature sets a threshold value, if above this threshold value, then jumps child node of turning left, otherwise then jumps child node of turning right, until it reaches leaf node, obtain the result of decision.
Under different scenes, it is possible to design different decision Binary Trees.For example, recommend, under this scene, following judgement to be done according to phonetic feature in requesting song: first determine whether to be child, if child, select child's voice as response voice;Then judge it is male voice or female voice, if male voice, select young girl's sound as response voice;Then judge to be anger, if not anger, whether sad continue judgement;If sad, continuing to determine whether cough, if there being cough, then can be determined that the basic condition of speaker is one " spadger that the excitement of flu is not high ", in such a case, it is possible to recommend some more cheerful and more light-hearted nursery rhymes, such as " healthy song ".This feedback information difference according to application scenarios, it is possible to be audio frequency, video or word.
Refer to Fig. 2, provide a kind of exchange method flow chart based on voice attributes classification.
First, extract the acoustic features of voice in the first signal, generate the first signal (step 100).The primary acoustic characteristic information extracted has the fundamental frequency signal of voice, MFCC signal, formant signal.It addition, for the accuracy increasing classification, extract the signals such as short-time energy signal, pitch jitter signal, harmonic to noise ratio on the basis of the above further in this step.
Secondly, the first signal is determined its voice attributes value by the pattern recognition detector trained with a large amount of labeled data in advance, generate secondary signal (step 200).The attribute identification grader adopting a large amount of acoustic features data to train in this step, identifies the probability that a certain phonetic feature occurs.Attribute identification grader can have multiple choices, including support vector machine (SVM), mixed Gauss model (GMM), neutral net (ANN), deep neural network (DNN) etc..The present embodiment selects deep neural network for voice attributes classification.The present embodiment is for 8 independent DNN of 8 attribute designs of voice.
Finally, based on secondary signal output feedack information (step 300).The present invention uses binary tree definition decision rules.On each node of binary tree, one threshold value is set for a certain voice attributes value, if above this threshold value, then jumps child node of turning left, otherwise then jump child node of turning right, until it reaches leaf node, obtain the result of decision, output feedack information.
In step 100, also including voice signal is digitized pretreatment and speech terminals detection, this process is to extract effective voice signal, reduces the increase of interference noiseless, that noise brings and amount of calculation.
Although it should be noted that, describe the operation of the inventive method in the accompanying drawings with particular order, but, this does not require that or implies and must operate to perform these according to this particular order, or having to carry out all shown operation could realize desired result.On the contrary, additionally or alternatively, it is convenient to omit some step, multiple steps are merged into a step and performs, and/or a step is decomposed into the execution of multiple step.Such as, the step 300 of the step 200 and voice attributes classification of extracting acoustic features can be merged into a step and perform.
Having the beneficial effects that of the present embodiment, extract one group of acoustic features to classify 8 kinds of voice attributes simultaneously, the feature of voice signal and potential information are excavated ground more abundant, the classification of speaker is more careful, therefore can make and have more cross reaction targetedly, obtain better Consumer's Experience.And, it is not necessary to user registers, but makes up the disappearance of log-on message with more phonetic feature so that with convenient flexibly.
Especially, according to embodiment of the disclosure, may be implemented as computer software programs above with reference to Fig. 2 method described.Such as, embodiment of the disclosure and include a kind of computer program, it includes the computer program being tangibly embodied on machine readable media, and described computer program comprises the program code of the method for performing Fig. 2.
Flow chart in accompanying drawing and block diagram, it is illustrated that according to the system of various embodiments of the invention, the architectural framework in the cards of method and computer program product, function and operation.In this, flow chart or each square frame in block diagram can represent a part for a unit, program segment or code, and a part for described unit, program segment or code comprises the executable instruction of one or more logic function for realizing regulation.It should also be noted that at some as in the realization replaced, the function marked in square frame can also to be different from the order generation marked in accompanying drawing.Such as, two square frames succeedingly represented can essentially perform substantially in parallel, and they can also perform sometimes in the opposite order, and this determines according to involved function.It will also be noted that, the combination of the square frame in each square frame in block diagram and/or flow chart and block diagram and/or flow chart, can realize by the special hardware based system of the function or operation that perform regulation, or can realize with the combination of specialized hardware Yu computer instruction.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Skilled artisan would appreciate that, invention scope involved in the application, it is not limited to the technical scheme of the particular combination of above-mentioned technical characteristic, when also should be encompassed in without departing from described inventive concept simultaneously, other technical scheme being carried out combination in any by above-mentioned technical characteristic or its equivalent feature and being formed.Such as features described above and (but not limited to) disclosed herein have the technical characteristic of similar functions and replace mutually and the technical scheme that formed.
Claims (14)
1., based on an interactive system for voice attributes classification, described system includes:
Acoustic features extraction unit, configuration, for extracting the acoustic features of the voice signal of input, generates the first signal;
Voice attributes taxon, configuration determines its voice attributes value for the first signal through attributive classification device, exports voice attributes result, generates secondary signal;
Interactive decision making unit, configuration is for based on secondary signal output feedack information.
2. system according to claim 1, it is characterised in that described acoustic features extraction unit includes leading portion processing unit, the configuration of described leading portion processing unit is for being digitized pretreatment and speech terminals detection to the voice signal of input.
3. system according to claim 1, it is characterised in that described acoustic features extraction unit includes, configuration is for extracting the fundamental frequency of voice, mel-frequency cepstrum coefficient (MFCC), formant.
4. system according to claim 3, it is characterised in that described acoustic features extraction unit, configuration also includes at least with the next item down for the described acoustic features extracted: short-time energy feature, pitch jitter and flicker, harmonic to noise ratio.
5. system according to claim 1, it is characterised in that institute's speech attribute taxon, including at least following a kind of attribute identification grader: gender attribute recognition classifier, age attribute recognition classifier, emotion attribute identification grader, healthy attribute identification grader.
6. system according to claim 1, it is characterised in that described attribute identification grader adopts deep neural network (DNN) algorithm.
7. system according to claim 6, it is characterized in that, the mode of operation of described attribute identification grader is divided into training mode and test pattern, wherein training mode adopts two-part training, including pre-training stage and fine setting stage, adopt without supervising limited Boltzmann machine model in the pre-training stage, adopt error backpropagation algorithm in the fine setting stage.
8., based on an exchange method for voice attributes classification, described method includes:
Extract the acoustic features of the voice signal of input, generate the first signal;
First signal determines its voice attributes value through attribute identification classification, exports voice attributes result, generates secondary signal;
Based on secondary signal output feedack information.
9. method according to claim 8, it is characterised in that extracted the acoustic features of the voice signal of input, processes including leading portion, and described leading portion processes for the voice signal of input is digitized pretreatment and speech terminals detection.
10. method according to claim 8, it is characterised in that described acoustic features includes, the fundamental frequency of voice, mel-frequency cepstrum coefficient (MFCC), formant.
11. method according to claim 10, it is characterised in that described acoustic features also includes at least with the next item down: short-time energy feature, pitch jitter and flicker, harmonic to noise ratio.
12. method according to claim 8, it is characterised in that described secondary signal is classified through voice attributes, including at least following a kind of attribute identification classification: recognition property discriminator, age attribute discriminator, emotion attribute identification is classified, healthy attribute identification classification.
13. method according to claim 8, it is characterised in that the classification of described attribute identification adopts deep neural network (DNN) algorithm.
14. method according to claim 13, it is characterized in that, the mode of operation of described attribute identification classification is divided into training mode and test pattern, wherein training mode adopts two-part training, including pre-training stage and fine setting stage, adopt without supervising limited Boltzmann machine model in the pre-training stage, adopt error backpropagation algorithm in the fine setting stage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610244968.8A CN105761720B (en) | 2016-04-19 | 2016-04-19 | Interactive system and method based on voice attribute classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610244968.8A CN105761720B (en) | 2016-04-19 | 2016-04-19 | Interactive system and method based on voice attribute classification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105761720A true CN105761720A (en) | 2016-07-13 |
CN105761720B CN105761720B (en) | 2020-01-07 |
Family
ID=56324445
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610244968.8A Active CN105761720B (en) | 2016-04-19 | 2016-04-19 | Interactive system and method based on voice attribute classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105761720B (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106686267A (en) * | 2015-11-10 | 2017-05-17 | 中国移动通信集团公司 | Method and system for implementing personalized voice service |
CN106898355A (en) * | 2017-01-17 | 2017-06-27 | 清华大学 | A kind of method for distinguishing speek person based on two modelings |
CN107316635A (en) * | 2017-05-19 | 2017-11-03 | 科大讯飞股份有限公司 | Audio recognition method and device, storage medium, electronic equipment |
CN107680599A (en) * | 2017-09-28 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | User property recognition methods, device and electronic equipment |
CN107886955A (en) * | 2016-09-29 | 2018-04-06 | 百度在线网络技术(北京)有限公司 | A kind of personal identification method, device and the equipment of voice conversation sample |
CN107995370A (en) * | 2017-12-21 | 2018-05-04 | 广东欧珀移动通信有限公司 | Call control method, device and storage medium and mobile terminal |
CN108109622A (en) * | 2017-12-28 | 2018-06-01 | 武汉蛋玩科技有限公司 | A kind of early education robot voice interactive education system and method |
CN108132995A (en) * | 2017-12-20 | 2018-06-08 | 北京百度网讯科技有限公司 | For handling the method and apparatus of audio-frequency information |
CN108186033A (en) * | 2018-01-08 | 2018-06-22 | 杭州草莽科技有限公司 | A kind of child's mood monitoring method and its system based on artificial intelligence |
WO2018132187A1 (en) * | 2017-01-12 | 2018-07-19 | Qualcomm Incorporated | Characteristic-based speech codebook selection |
CN108701469A (en) * | 2017-07-31 | 2018-10-23 | 深圳和而泰智能家居科技有限公司 | Cough sound recognition methods, equipment and storage medium |
CN109065075A (en) * | 2018-09-26 | 2018-12-21 | 广州势必可赢网络科技有限公司 | A kind of method of speech processing, device, system and computer readable storage medium |
CN109102805A (en) * | 2018-09-20 | 2018-12-28 | 北京长城华冠汽车技术开发有限公司 | Voice interactive method, device and realization device |
CN109165284A (en) * | 2018-08-22 | 2019-01-08 | 重庆邮电大学 | A kind of financial field human-computer dialogue intension recognizing method based on big data |
CN110021308A (en) * | 2019-05-16 | 2019-07-16 | 北京百度网讯科技有限公司 | Voice mood recognition methods, device, computer equipment and storage medium |
CN110379441A (en) * | 2019-07-01 | 2019-10-25 | 特斯联(北京)科技有限公司 | A kind of voice service method and system based on countering type smart network |
CN110600042A (en) * | 2019-10-10 | 2019-12-20 | 公安部第三研究所 | Method and system for recognizing gender of disguised voice speaker |
CN111179915A (en) * | 2019-12-30 | 2020-05-19 | 苏州思必驰信息科技有限公司 | Age identification method and device based on voice |
CN111599342A (en) * | 2019-02-21 | 2020-08-28 | 北京京东尚科信息技术有限公司 | Tone selecting method and system |
CN111772422A (en) * | 2020-06-12 | 2020-10-16 | 广州城建职业学院 | Intelligent crib |
CN111989742A (en) * | 2018-04-13 | 2020-11-24 | 三菱电机株式会社 | Speech recognition system and method for using speech recognition system |
CN112530418A (en) * | 2019-08-28 | 2021-03-19 | 北京声智科技有限公司 | Voice wake-up method, device and related equipment |
CN113143570A (en) * | 2021-04-27 | 2021-07-23 | 福州大学 | Multi-sensor fusion feedback adjustment snore stopping pillow |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1107227A2 (en) * | 1999-11-30 | 2001-06-13 | Sony Corporation | Voice processing |
JP2003345385A (en) * | 2002-05-30 | 2003-12-03 | Matsushita Electric Ind Co Ltd | Voice recognition and discrimination device |
CN1564245A (en) * | 2004-04-20 | 2005-01-12 | 上海上悦通讯技术有限公司 | Stunt method and device for baby's crying |
CN1975856A (en) * | 2006-10-30 | 2007-06-06 | 邹采荣 | Speech emotion identifying method based on supporting vector machine |
CN101201980A (en) * | 2007-12-19 | 2008-06-18 | 北京交通大学 | Remote Chinese language teaching system based on voice affection identification |
US20100138223A1 (en) * | 2007-03-26 | 2010-06-03 | Takafumi Koshinaka | Speech classification apparatus, speech classification method, and speech classification program |
US8239194B1 (en) * | 2011-07-28 | 2012-08-07 | Google Inc. | System and method for multi-channel multi-feature speech/noise classification for noise suppression |
CN103117060A (en) * | 2013-01-18 | 2013-05-22 | 中国科学院声学研究所 | Modeling approach and modeling system of acoustic model used in speech recognition |
CN103546503A (en) * | 2012-07-10 | 2014-01-29 | 百度在线网络技术(北京)有限公司 | Voice-based cloud social system, voice-based cloud social method and cloud analysis server |
-
2016
- 2016-04-19 CN CN201610244968.8A patent/CN105761720B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1107227A2 (en) * | 1999-11-30 | 2001-06-13 | Sony Corporation | Voice processing |
JP2003345385A (en) * | 2002-05-30 | 2003-12-03 | Matsushita Electric Ind Co Ltd | Voice recognition and discrimination device |
CN1564245A (en) * | 2004-04-20 | 2005-01-12 | 上海上悦通讯技术有限公司 | Stunt method and device for baby's crying |
CN1975856A (en) * | 2006-10-30 | 2007-06-06 | 邹采荣 | Speech emotion identifying method based on supporting vector machine |
US20100138223A1 (en) * | 2007-03-26 | 2010-06-03 | Takafumi Koshinaka | Speech classification apparatus, speech classification method, and speech classification program |
CN101201980A (en) * | 2007-12-19 | 2008-06-18 | 北京交通大学 | Remote Chinese language teaching system based on voice affection identification |
US8239194B1 (en) * | 2011-07-28 | 2012-08-07 | Google Inc. | System and method for multi-channel multi-feature speech/noise classification for noise suppression |
CN103546503A (en) * | 2012-07-10 | 2014-01-29 | 百度在线网络技术(北京)有限公司 | Voice-based cloud social system, voice-based cloud social method and cloud analysis server |
CN103117060A (en) * | 2013-01-18 | 2013-05-22 | 中国科学院声学研究所 | Modeling approach and modeling system of acoustic model used in speech recognition |
Non-Patent Citations (1)
Title |
---|
怀进鹏: "《智能计算机研究进展 863计划智能计算机主题学术会议论文集》", 31 March 2001 * |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106686267A (en) * | 2015-11-10 | 2017-05-17 | 中国移动通信集团公司 | Method and system for implementing personalized voice service |
CN107886955A (en) * | 2016-09-29 | 2018-04-06 | 百度在线网络技术(北京)有限公司 | A kind of personal identification method, device and the equipment of voice conversation sample |
WO2018132187A1 (en) * | 2017-01-12 | 2018-07-19 | Qualcomm Incorporated | Characteristic-based speech codebook selection |
US10878831B2 (en) | 2017-01-12 | 2020-12-29 | Qualcomm Incorporated | Characteristic-based speech codebook selection |
CN106898355A (en) * | 2017-01-17 | 2017-06-27 | 清华大学 | A kind of method for distinguishing speek person based on two modelings |
CN106898355B (en) * | 2017-01-17 | 2020-04-14 | 北京华控智加科技有限公司 | Speaker identification method based on secondary modeling |
CN107316635A (en) * | 2017-05-19 | 2017-11-03 | 科大讯飞股份有限公司 | Audio recognition method and device, storage medium, electronic equipment |
CN108701469B (en) * | 2017-07-31 | 2023-06-20 | 深圳和而泰智能控制股份有限公司 | Cough sound recognition method, device, and storage medium |
CN108701469A (en) * | 2017-07-31 | 2018-10-23 | 深圳和而泰智能家居科技有限公司 | Cough sound recognition methods, equipment and storage medium |
CN107680599A (en) * | 2017-09-28 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | User property recognition methods, device and electronic equipment |
CN108132995A (en) * | 2017-12-20 | 2018-06-08 | 北京百度网讯科技有限公司 | For handling the method and apparatus of audio-frequency information |
CN107995370A (en) * | 2017-12-21 | 2018-05-04 | 广东欧珀移动通信有限公司 | Call control method, device and storage medium and mobile terminal |
CN108109622A (en) * | 2017-12-28 | 2018-06-01 | 武汉蛋玩科技有限公司 | A kind of early education robot voice interactive education system and method |
CN108186033A (en) * | 2018-01-08 | 2018-06-22 | 杭州草莽科技有限公司 | A kind of child's mood monitoring method and its system based on artificial intelligence |
CN111989742A (en) * | 2018-04-13 | 2020-11-24 | 三菱电机株式会社 | Speech recognition system and method for using speech recognition system |
CN109165284A (en) * | 2018-08-22 | 2019-01-08 | 重庆邮电大学 | A kind of financial field human-computer dialogue intension recognizing method based on big data |
CN109102805A (en) * | 2018-09-20 | 2018-12-28 | 北京长城华冠汽车技术开发有限公司 | Voice interactive method, device and realization device |
CN109065075A (en) * | 2018-09-26 | 2018-12-21 | 广州势必可赢网络科技有限公司 | A kind of method of speech processing, device, system and computer readable storage medium |
CN111599342A (en) * | 2019-02-21 | 2020-08-28 | 北京京东尚科信息技术有限公司 | Tone selecting method and system |
CN110021308A (en) * | 2019-05-16 | 2019-07-16 | 北京百度网讯科技有限公司 | Voice mood recognition methods, device, computer equipment and storage medium |
CN110379441A (en) * | 2019-07-01 | 2019-10-25 | 特斯联(北京)科技有限公司 | A kind of voice service method and system based on countering type smart network |
CN112530418A (en) * | 2019-08-28 | 2021-03-19 | 北京声智科技有限公司 | Voice wake-up method, device and related equipment |
CN110600042B (en) * | 2019-10-10 | 2020-10-23 | 公安部第三研究所 | Method and system for recognizing gender of disguised voice speaker |
CN110600042A (en) * | 2019-10-10 | 2019-12-20 | 公安部第三研究所 | Method and system for recognizing gender of disguised voice speaker |
CN111179915A (en) * | 2019-12-30 | 2020-05-19 | 苏州思必驰信息科技有限公司 | Age identification method and device based on voice |
CN111772422A (en) * | 2020-06-12 | 2020-10-16 | 广州城建职业学院 | Intelligent crib |
CN113143570A (en) * | 2021-04-27 | 2021-07-23 | 福州大学 | Multi-sensor fusion feedback adjustment snore stopping pillow |
CN113143570B (en) * | 2021-04-27 | 2023-08-11 | 福州大学 | Snore relieving pillow with multiple sensors integrated with feedback adjustment |
Also Published As
Publication number | Publication date |
---|---|
CN105761720B (en) | 2020-01-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105761720A (en) | Interaction system based on voice attribute classification, and method thereof | |
CN109243494B (en) | Children emotion recognition method based on multi-attention mechanism long-time memory network | |
Schuller et al. | Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture | |
Schuller et al. | Speaker independent speech emotion recognition by ensemble classification | |
Tong et al. | A comparative study of robustness of deep learning approaches for VAD | |
CN112581979B (en) | Speech emotion recognition method based on spectrogram | |
Joshy et al. | Automated dysarthria severity classification: A study on acoustic features and deep learning techniques | |
Ghai et al. | Emotion recognition on speech signals using machine learning | |
Samantaray et al. | A novel approach of speech emotion recognition with prosody, quality and derived features using SVM classifier for a class of North-Eastern Languages | |
Cnn | Speech emotion recognition using convolutional neural network (CNN) | |
CN111899766B (en) | Speech emotion recognition method based on optimization fusion of depth features and acoustic features | |
CN110085216A (en) | A kind of vagitus detection method and device | |
CN111916066A (en) | Random forest based voice tone recognition method and system | |
Caihua | Research on multi-modal mandarin speech emotion recognition based on SVM | |
Přibil et al. | GMM-based speaker age and gender classification in Czech and Slovak | |
Khan et al. | Quranic reciter recognition: a machine learning approach | |
Cao et al. | Speaker-independent speech emotion recognition based on random forest feature selection algorithm | |
Praksah et al. | Analysis of emotion recognition system through speech signal using KNN, GMM & SVM classifier | |
CN111081273A (en) | Voice emotion recognition method based on glottal wave signal feature extraction | |
Watrous | Phoneme discrimination using connectionist networks | |
Ling | An acoustic model for English speech recognition based on deep learning | |
CN108899046A (en) | A kind of speech-emotion recognition method and system based on Multistage Support Vector Machine classification | |
Gomes et al. | i-vector algorithm with Gaussian Mixture Model for efficient speech emotion recognition | |
Alshamsi et al. | Automated speech emotion recognition on smart phones | |
CN113571095A (en) | Speech emotion recognition method and system based on nested deep neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |