CN105761720A - Interaction system based on voice attribute classification, and method thereof - Google Patents

Interaction system based on voice attribute classification, and method thereof Download PDF

Info

Publication number
CN105761720A
CN105761720A CN201610244968.8A CN201610244968A CN105761720A CN 105761720 A CN105761720 A CN 105761720A CN 201610244968 A CN201610244968 A CN 201610244968A CN 105761720 A CN105761720 A CN 105761720A
Authority
CN
China
Prior art keywords
voice
signal
classification
acoustic features
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610244968.8A
Other languages
Chinese (zh)
Other versions
CN105761720B (en
Inventor
潘复平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Horizon Robotics Technology Research and Development Co Ltd
Original Assignee
Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Horizon Robotics Technology Research and Development Co Ltd filed Critical Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority to CN201610244968.8A priority Critical patent/CN105761720B/en
Publication of CN105761720A publication Critical patent/CN105761720A/en
Application granted granted Critical
Publication of CN105761720B publication Critical patent/CN105761720B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an interaction system based on voice attribute classification, and a method thereof. The interaction system includes an acoustic feature extraction unit, a voice attribute classification unit and an interaction decision making unit, wherein the acoustic feature extraction unit is configured for extracting the acoustic features of an input voice signal so as to generate a first signal; the voice attribute classification unit is configured for determining the voice attribute value of the first signal through an attribute identification and classification device, and outputting the voice attribute result so as to generate a second signal; the interaction decision making unit is configured for outputting the feedback information based on the second signal; and the voice attribute classification unit can detect various voice attributes at the same time, and can output the corresponding feedback information according to each voice attribute value to enable the interaction process to be rich and colorful.

Description

A kind of interactive system based on voice attributes classification and method thereof
Technical field
The disclosure relates generally to mutual field, is specifically related to human-computer interaction technology, particularly relates to the interactive system based on voice attributes.
Background technology
Conventional man machine language's interaction shows as the phonetic order that machine recognition people sends, and then according to recognition result, makes corresponding reaction.This content comprised alternately is only limitted to the literal meaning of phonetic order, and form is single, and Consumer's Experience is uninteresting, and not being suitable for toy, household etc. needs various interaction scenarios that is lively in form.
At present, in man-machine interaction, judge user identity frequently with voiceprint registration technology, it is achieved hommization mutual.In voiceprint registration process, first with sound groove recognition technology in e, the voice of user is registered, user identity is associated with vocal print, then in use, first identify the vocal print of speaker, then judge the identity of speaker according to vocal print, carry out some limited mutual changes further according to user identity.Such as according to sound, some intelligent toys can judge that current speaker is father, mother or baby, the difference according to identity, thus it is possible to vary the appellation to speaker.
Prior art disadvantageously, on the one hand, conventional art often can only detect a kind of voice attributes, the difference according to voice attributes, and the change that interaction content occurs is also extremely limited;On the other hand, voiceprint registration technology uses very loaded down with trivial details and dumb.
Summary of the invention
In view of drawbacks described above of the prior art or deficiency, it is desirable to provide a kind of interactive system based on voice attributes classification and method thereof.
First aspect, it is proposed to a kind of interactive system based on voice attributes classification, this system includes:
Acoustic features extraction unit, configuration, for extracting the acoustic features of the voice signal of input, generates the first signal;
Voice attributes taxon, configuration determines its voice attributes value for the first signal through attribute identification grader, exports voice attributes result, generates secondary signal;
Interactive decision making unit, configuration is for based on secondary signal output feedack information.
Second aspect provides a kind of exchange method based on voice attributes classification, and the method includes:
Extract the acoustic features of the voice signal of input, generate the first signal;
First signal determines its voice attributes value through attribute identification classification, exports voice attributes result, generates secondary signal;
Based on secondary signal output feedack information.
According to the technical scheme that the embodiment of the present application provides, voice attributes taxon can detect the multiple voice attribute of voice simultaneously, and exports corresponding feedback information according to each voice attributes value, makes interaction flow rich and varied;It addition, the present invention is classified by voice attributes, it is possible to the identity of automatic decision speaker, so not needing registration process, easy to use, freedom and flexibility.
Accompanying drawing explanation
By reading the detailed description that non-limiting example is made made with reference to the following drawings, other features, purpose and advantage will become more apparent upon:
Fig. 1 is the structure chart of a kind of interactive system based on voice attributes classification according to embodiment.
Fig. 2 is the flow chart of a kind of exchange method based on voice attributes classification.
Detailed description of the invention
Below in conjunction with drawings and Examples, the application is described in further detail.It is understood that specific embodiment described herein is used only for explaining related invention, but not the restriction to this invention.It also should be noted that, for the ease of describing, accompanying drawing illustrate only and invent relevant part.
It should be noted that when not conflicting, the embodiment in the application and the feature in embodiment can be mutually combined.
In interactive voice process, except identifying the word content of phonetic order, it is also possible to identify other voice attributes of voice, and utilize form and the content of these voice attributes rich interactives.These voice attributes include: the age range of speaker, sex, emotion, health degree etc..Age and sex can be reflected in fundamental frequency and the tone color of voice;Emotion can be reflected in the stress of voice, tone, word speed and pause;Whether health degree can be reflected in voice hoarse, if with cough, if with in the phenomenons such as rhinophonia.The same voice attributes of different speakers shows the identical regularity of distribution on the voice signal, and the fundamental frequency of such as male voice is relatively low, and spectrum energy focuses mostly at low frequency region, and the fundamental frequency of female voice is higher, and spectrum energy focuses mostly at high-frequency region.Voice-based above feature, it is possible to collect the speech data in a large number with same alike result, extracts the voice attributes labeled data that can reflect this attribute, trains attribute identification grader, it is simple to this voice attributes is classified.For multiple voice attributes, it is possible to train multiple attribute identification grader, carry out respectively classifying and adjudicating.After obtaining a series of voice attributes values of voice, with they in special sound interaction scenarios according to set interactive decision making, export mutual feedback information.
The present invention can be applied in the interaction scenarios of requesting song, if the emotion identifying speaker is sadder, it is possible to recommend the song that some are cheerful and light-hearted, if human feelings thread of speaking is more irritable, it is possible to recommend the song that some are gentle.
Describe the application below with reference to the accompanying drawings and in conjunction with the embodiments in detail.
Refer to Fig. 1, provide the structure chart of a kind of embodiment of interactive system based on voice attributes classification, this system includes:
Acoustic features extraction unit 10, configuration, for extracting the acoustic features of the voice signal of input, generates the first signal;
Voice attributes taxon 20, configuration, for the first signal is classified through voice attributes, exports voice attributes result, generates secondary signal;
Interactive decision making unit 30, configuration is for judging mutual type, output feedack information based on secondary signal.Acoustic features extraction unit 10 also includes front-end processing unit, configuration is for being digitized pretreatment and speech terminals detection to the voice signal of input, front-end processing unit, primary responsibility obtains effective voice signal, reduces the increase of interference noiseless, that noise brings and amount of calculation.
Acoustic features extraction unit 10 extracts a series of acoustic featuress of reflection voice attributes.The acoustic features extracted specifically includes that
Fundamental frequency: fundamental tone refers to periodicity when sending out voiced sound caused by vocal cord vibration, and fundamental frequency is exactly the frequency of vocal cord vibration.Fundamental tone is one of most important parameter of voice signal, can embody comprise in voice emotion, the age, the information such as sex.Due to the non-stationary of voice signal and aperiodicity, and the excursion of pitch period is very wide, makes the accurately detection of fundamental frequency become highly difficult.The present embodiment uses Cepstrum Method detection fundamental frequency.
MFCC (mel-frequency cepstrum coefficient): spectrum signature is short-time characteristic.When extracting spectrum signature, in order to utilize the auditory system feature of the mankind, generally the frequency spectrum of voice signal is passed through the mid frequency band filter based on human perception yardstick, then from these by extracting spectrum signature the signal of filtering, the present embodiment adopts Mel frequency cepstral coefficient (MFCC) feature.
Formant: when speaking, sound channel can constantly change adaptation makes language clear, and sound channel length is also affected by the impact of speaker's emotional state simultaneously.During pronunciation, sound channel role is sympathetic response effect, can cause resonance characteristics when vowel excitation enters sound channel, produce one group of resonant frequency, it is simply that so-called formant frequency, be called for short formant, and they depend on shape and the physical features of sound channel.Different vowels correspond to different formant parameters, uses more many formants can better describe voice, general first three signal of collection in practical application.
Above three is characterized by the basic acoustic features for the present invention, is capable of the present invention based on above-mentioned acoustic features through voice attributes classification.In order to reach more excellent effect, it is possible to extract the following acoustic features of speaker further:
Short-time energy feature: the energy of voice signal reflects the intensity of voice, has stronger directly related property with emotional information.Short-time energy is calculated from signal time domain, and it calculates the signal amplitude quadratic sum of a frame voice.
Pitch jitter and flicker: shake refers to the fundamental frequency shake during the week of front and back, the fundamental frequency amplitude of variation of two frame voice signals before and after namely.Flicker refers to the energy flicker during former and later two weeks, the short-time energy amplitude of variation of adjacent two frame voice signals before and after namely.
Harmonic to noise ratio: as the term suggests referring to harmonic wave and the ratio of noise contribution in voice signal, the change of emotion can be reflected to a certain extent.
Voice attributes taxon 20, according to the voice attributes selected, arranges at least one attribute identification grader, each attribute identification grader, adopt mode identification technology, the acoustic features of said extracted is input in attribute identification grader, the output of attribute identification grader and detection of attribute result.The present embodiment selects 8 kinds of voice attributes as object of classification, detects gender attribute, age attribute, emotion attribute, healthy attribute respectively, specific as follows:
Gender attribute:
First voice attributes: be used for detecting male voice or female voice;
Age attribute:
Second voice attributes: be used for detecting child and be still grown up;
Emotion attribute:
3rd voice attributes: for detecting whether angry;
4th voice attributes: for detecting whether sad;
5th voice attributes: for detecting whether cheerful and light-hearted;
Healthy attribute:
6th voice attributes: for detecting whether cough;
7th voice attributes: for detecting whether rhinophonia;
8th voice attributes: for detecting whether hoarse;
The mode of operation of attribute identification grader is divided into two kinds: one to be training mode, and two is test pattern.Under training mode, potential feature and rule in attribute identification grader learning data sample, collect mass data sample, manually mark the voice attributes classification belonging to each data sample simultaneously, data sample and corresponding voice attributes classification mark are input in attribute identification grader, adopt training algorithm, the model parameter in attribute identification grader is adjusted.After having trained, the feature of different classes of data is all reflected in attribute identification sorter model parameter, it is possible to for the test to new data.Under test pattern, the new data gathered are made directly classification according to the rule learnt before by attribute identification grader, it is not necessary to the step of artificial mark, output category result.
In the present embodiment, it is necessary to detection multiple voice feature, being respectively trained different attribute identification graders for each voice attributes, each attribute identification grader exports two classification, identifies the probability of this attribute " just " and negation.Such as, male/female is exported for gender attribute;Child/adult is exported for age attribute;For healthy attribute output Yes/No etc..
The algorithm of attribute identification grader has multiple choices, including support vector machine (SVM), mixed Gauss model (GMM), neutral net (ANN), deep neural network (DNN) etc., the present embodiment selects deep neural network to constitute attributive classification unit.
The present embodiment is 8 independent attribute identification graders to 8 attribute designs of voice, and this attribute identification grader adopts deep neural network algorithm.
Each attribute identification grader DNN adopts identical structure: input layer comprises 51 nodes, corresponding above-mentioned acoustic features;Comprising 4 hidden layers, each hidden layer has 512 nodes;Output layer comprises two nodes, two classifications of " just " negation of corresponding voice attributes;The activation primitive of hidden layer node adopts sigmoid function;The activation primitive of output layer node adopts softmax function;Full connection is adopted between each layer.The weight w that node connects is free parameter, it is necessary to training obtains.
Training adopts two-part training method, and process is as follows:
1) pre-training (Pre-training), adopts and without the limited Boltzmann machine of supervision (RestrictedBoltzmannMachine, RBM), each layer weight of DNN is successively initialized;
Limited Boltzmann machine (RBM) is a kind of production model.By the inspiration of energy functional in statistical mechanics, RBM introduces energy function to describe the probability distribution of data.Energy function is that the one describing whole system state is estimated.System is more orderly or probability distribution is more concentrated, and the energy of system is more little.Otherwise, system is more unordered or probability distribution more tends to being uniformly distributed, then the energy of system is more big.The minima of energy function, corresponding to the most steady statue of system.RBM is by comprising two-layer node, it is seen that layer (Visible-Layer) and hidden layer (Hidden-Layer).Generally, it is seen that layer input initial data, hidden layer output is the feature learning to arrive.Here so-called " limited " each node referring to same layer is without connection, and the node interconnection between different layers.Assuming that visible layer variable is v, hidden layer variable is h, when input layer and hidden layer be all Bernoulli Jacob be distributed time, both joint probability distribution p (v, h) can pass through energy function E (v, h) definition:
E (v, h)=-Σi∈visibleaivij∈hiddenbjhjijvihjwij(1)
w i j P ( v , h ) = 1 Z e - E ( v , h ) - - - ( 2 )
Wherein wijRepresenting the weights that couple of visible layer node i and hidden layer node j, vector a, b represent the deviation of visible layer and hidden layer respectively;In formula (2), Z is normalization coefficient.By joint probability p (v, h) takes edge integration to variable h and obtains observation likelihood probability p (v) of data, as shown in formula (3):
P ( v ) = Σ h P ( v , h ) = 1 Z Σ h e - E ( v , h ) - - - ( 3 )
From maximal possibility estimation, the criterion function of RBM training is:
W * = arg max L ( w ) = argmaxΣ n = 1 N log p ( v n | w ) - - - ( 4 )
Wherein w represents that weight parameter, n represent the n-th training sample, adopts gradient descent method that formula (4) is optimized the updated value obtaining weights and is:
ΔWij=< vi,hj>data-<vi,hj>model(5)
This variable is taken expectation by<>therein expression.Wherein Section 1, is the expectation seeking given sample data;And Section 2 is the expectation of modulus type itself, the expectation of model is not directly available.Typical method is to be obtained by gibbs sampler, and one is called formula (5) can be solved by sdpecific dispersion (ContrastiveDivergence) fast algorithm efficiently.The RBM weight trained can be used to DNN is initialized: successively trains RBM, near low layer RBM hidden layer output can as it visible layer of next layer of RBM, thus adding up successively, reach set the DNN number of plies.
2) fine setting (Fine-tuning), adopts (ErrorBack-Propagation, EBP) algorithm that the network parameter initialized is adjusted.The fine setting stage is the training method of the error back propagation adopted.
The acoustic features of every frame voice is independently input in these 8 DNN, produce the output of 8 voice attributes values, identify the probability of each attribute, the meansigma methods of probability output of one section of all speech frame of voice is calculated as the final probability of this section of phonetic feature, i.e. classification results and secondary signal according to the formula (6) that is described below.
P k , p o s = 1 Z &Sigma; n P k n , p o s - - - ( 6 )
Wherein, k represents that voice attributes is numbered, and in this example, the scope of k is from 1 to 8;N represents the frame number of this voice segments;Pkn,posRepresent the probability that n-th frame voice attributes k is positive (positive);Pk,posRepresent the probability average that the voice attributes k detecting this N number of frame is positive (positive), i.e. " just " output of DNN.
Secondary signal as input, is made the decision-making of interaction content, output feedack information by interactive decision making unit 30.The present embodiment uses binary tree definition decision rules.On each node of binary tree, the probability for a certain feature sets a threshold value, if above this threshold value, then jumps child node of turning left, otherwise then jumps child node of turning right, until it reaches leaf node, obtain the result of decision.
Under different scenes, it is possible to design different decision Binary Trees.For example, recommend, under this scene, following judgement to be done according to phonetic feature in requesting song: first determine whether to be child, if child, select child's voice as response voice;Then judge it is male voice or female voice, if male voice, select young girl's sound as response voice;Then judge to be anger, if not anger, whether sad continue judgement;If sad, continuing to determine whether cough, if there being cough, then can be determined that the basic condition of speaker is one " spadger that the excitement of flu is not high ", in such a case, it is possible to recommend some more cheerful and more light-hearted nursery rhymes, such as " healthy song ".This feedback information difference according to application scenarios, it is possible to be audio frequency, video or word.
Refer to Fig. 2, provide a kind of exchange method flow chart based on voice attributes classification.
First, extract the acoustic features of voice in the first signal, generate the first signal (step 100).The primary acoustic characteristic information extracted has the fundamental frequency signal of voice, MFCC signal, formant signal.It addition, for the accuracy increasing classification, extract the signals such as short-time energy signal, pitch jitter signal, harmonic to noise ratio on the basis of the above further in this step.
Secondly, the first signal is determined its voice attributes value by the pattern recognition detector trained with a large amount of labeled data in advance, generate secondary signal (step 200).The attribute identification grader adopting a large amount of acoustic features data to train in this step, identifies the probability that a certain phonetic feature occurs.Attribute identification grader can have multiple choices, including support vector machine (SVM), mixed Gauss model (GMM), neutral net (ANN), deep neural network (DNN) etc..The present embodiment selects deep neural network for voice attributes classification.The present embodiment is for 8 independent DNN of 8 attribute designs of voice.
Finally, based on secondary signal output feedack information (step 300).The present invention uses binary tree definition decision rules.On each node of binary tree, one threshold value is set for a certain voice attributes value, if above this threshold value, then jumps child node of turning left, otherwise then jump child node of turning right, until it reaches leaf node, obtain the result of decision, output feedack information.
In step 100, also including voice signal is digitized pretreatment and speech terminals detection, this process is to extract effective voice signal, reduces the increase of interference noiseless, that noise brings and amount of calculation.
Although it should be noted that, describe the operation of the inventive method in the accompanying drawings with particular order, but, this does not require that or implies and must operate to perform these according to this particular order, or having to carry out all shown operation could realize desired result.On the contrary, additionally or alternatively, it is convenient to omit some step, multiple steps are merged into a step and performs, and/or a step is decomposed into the execution of multiple step.Such as, the step 300 of the step 200 and voice attributes classification of extracting acoustic features can be merged into a step and perform.
Having the beneficial effects that of the present embodiment, extract one group of acoustic features to classify 8 kinds of voice attributes simultaneously, the feature of voice signal and potential information are excavated ground more abundant, the classification of speaker is more careful, therefore can make and have more cross reaction targetedly, obtain better Consumer's Experience.And, it is not necessary to user registers, but makes up the disappearance of log-on message with more phonetic feature so that with convenient flexibly.
Especially, according to embodiment of the disclosure, may be implemented as computer software programs above with reference to Fig. 2 method described.Such as, embodiment of the disclosure and include a kind of computer program, it includes the computer program being tangibly embodied on machine readable media, and described computer program comprises the program code of the method for performing Fig. 2.
Flow chart in accompanying drawing and block diagram, it is illustrated that according to the system of various embodiments of the invention, the architectural framework in the cards of method and computer program product, function and operation.In this, flow chart or each square frame in block diagram can represent a part for a unit, program segment or code, and a part for described unit, program segment or code comprises the executable instruction of one or more logic function for realizing regulation.It should also be noted that at some as in the realization replaced, the function marked in square frame can also to be different from the order generation marked in accompanying drawing.Such as, two square frames succeedingly represented can essentially perform substantially in parallel, and they can also perform sometimes in the opposite order, and this determines according to involved function.It will also be noted that, the combination of the square frame in each square frame in block diagram and/or flow chart and block diagram and/or flow chart, can realize by the special hardware based system of the function or operation that perform regulation, or can realize with the combination of specialized hardware Yu computer instruction.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Skilled artisan would appreciate that, invention scope involved in the application, it is not limited to the technical scheme of the particular combination of above-mentioned technical characteristic, when also should be encompassed in without departing from described inventive concept simultaneously, other technical scheme being carried out combination in any by above-mentioned technical characteristic or its equivalent feature and being formed.Such as features described above and (but not limited to) disclosed herein have the technical characteristic of similar functions and replace mutually and the technical scheme that formed.

Claims (14)

1., based on an interactive system for voice attributes classification, described system includes:
Acoustic features extraction unit, configuration, for extracting the acoustic features of the voice signal of input, generates the first signal;
Voice attributes taxon, configuration determines its voice attributes value for the first signal through attributive classification device, exports voice attributes result, generates secondary signal;
Interactive decision making unit, configuration is for based on secondary signal output feedack information.
2. system according to claim 1, it is characterised in that described acoustic features extraction unit includes leading portion processing unit, the configuration of described leading portion processing unit is for being digitized pretreatment and speech terminals detection to the voice signal of input.
3. system according to claim 1, it is characterised in that described acoustic features extraction unit includes, configuration is for extracting the fundamental frequency of voice, mel-frequency cepstrum coefficient (MFCC), formant.
4. system according to claim 3, it is characterised in that described acoustic features extraction unit, configuration also includes at least with the next item down for the described acoustic features extracted: short-time energy feature, pitch jitter and flicker, harmonic to noise ratio.
5. system according to claim 1, it is characterised in that institute's speech attribute taxon, including at least following a kind of attribute identification grader: gender attribute recognition classifier, age attribute recognition classifier, emotion attribute identification grader, healthy attribute identification grader.
6. system according to claim 1, it is characterised in that described attribute identification grader adopts deep neural network (DNN) algorithm.
7. system according to claim 6, it is characterized in that, the mode of operation of described attribute identification grader is divided into training mode and test pattern, wherein training mode adopts two-part training, including pre-training stage and fine setting stage, adopt without supervising limited Boltzmann machine model in the pre-training stage, adopt error backpropagation algorithm in the fine setting stage.
8., based on an exchange method for voice attributes classification, described method includes:
Extract the acoustic features of the voice signal of input, generate the first signal;
First signal determines its voice attributes value through attribute identification classification, exports voice attributes result, generates secondary signal;
Based on secondary signal output feedack information.
9. method according to claim 8, it is characterised in that extracted the acoustic features of the voice signal of input, processes including leading portion, and described leading portion processes for the voice signal of input is digitized pretreatment and speech terminals detection.
10. method according to claim 8, it is characterised in that described acoustic features includes, the fundamental frequency of voice, mel-frequency cepstrum coefficient (MFCC), formant.
11. method according to claim 10, it is characterised in that described acoustic features also includes at least with the next item down: short-time energy feature, pitch jitter and flicker, harmonic to noise ratio.
12. method according to claim 8, it is characterised in that described secondary signal is classified through voice attributes, including at least following a kind of attribute identification classification: recognition property discriminator, age attribute discriminator, emotion attribute identification is classified, healthy attribute identification classification.
13. method according to claim 8, it is characterised in that the classification of described attribute identification adopts deep neural network (DNN) algorithm.
14. method according to claim 13, it is characterized in that, the mode of operation of described attribute identification classification is divided into training mode and test pattern, wherein training mode adopts two-part training, including pre-training stage and fine setting stage, adopt without supervising limited Boltzmann machine model in the pre-training stage, adopt error backpropagation algorithm in the fine setting stage.
CN201610244968.8A 2016-04-19 2016-04-19 Interactive system and method based on voice attribute classification Active CN105761720B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610244968.8A CN105761720B (en) 2016-04-19 2016-04-19 Interactive system and method based on voice attribute classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610244968.8A CN105761720B (en) 2016-04-19 2016-04-19 Interactive system and method based on voice attribute classification

Publications (2)

Publication Number Publication Date
CN105761720A true CN105761720A (en) 2016-07-13
CN105761720B CN105761720B (en) 2020-01-07

Family

ID=56324445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610244968.8A Active CN105761720B (en) 2016-04-19 2016-04-19 Interactive system and method based on voice attribute classification

Country Status (1)

Country Link
CN (1) CN105761720B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106686267A (en) * 2015-11-10 2017-05-17 中国移动通信集团公司 Method and system for implementing personalized voice service
CN106898355A (en) * 2017-01-17 2017-06-27 清华大学 A kind of method for distinguishing speek person based on two modelings
CN107316635A (en) * 2017-05-19 2017-11-03 科大讯飞股份有限公司 Audio recognition method and device, storage medium, electronic equipment
CN107680599A (en) * 2017-09-28 2018-02-09 百度在线网络技术(北京)有限公司 User property recognition methods, device and electronic equipment
CN107886955A (en) * 2016-09-29 2018-04-06 百度在线网络技术(北京)有限公司 A kind of personal identification method, device and the equipment of voice conversation sample
CN107995370A (en) * 2017-12-21 2018-05-04 广东欧珀移动通信有限公司 Call control method, device and storage medium and mobile terminal
CN108109622A (en) * 2017-12-28 2018-06-01 武汉蛋玩科技有限公司 A kind of early education robot voice interactive education system and method
CN108132995A (en) * 2017-12-20 2018-06-08 北京百度网讯科技有限公司 For handling the method and apparatus of audio-frequency information
CN108186033A (en) * 2018-01-08 2018-06-22 杭州草莽科技有限公司 A kind of child's mood monitoring method and its system based on artificial intelligence
WO2018132187A1 (en) * 2017-01-12 2018-07-19 Qualcomm Incorporated Characteristic-based speech codebook selection
CN108701469A (en) * 2017-07-31 2018-10-23 深圳和而泰智能家居科技有限公司 Cough sound recognition methods, equipment and storage medium
CN109065075A (en) * 2018-09-26 2018-12-21 广州势必可赢网络科技有限公司 A kind of method of speech processing, device, system and computer readable storage medium
CN109102805A (en) * 2018-09-20 2018-12-28 北京长城华冠汽车技术开发有限公司 Voice interactive method, device and realization device
CN109165284A (en) * 2018-08-22 2019-01-08 重庆邮电大学 A kind of financial field human-computer dialogue intension recognizing method based on big data
CN110021308A (en) * 2019-05-16 2019-07-16 北京百度网讯科技有限公司 Voice mood recognition methods, device, computer equipment and storage medium
CN110379441A (en) * 2019-07-01 2019-10-25 特斯联(北京)科技有限公司 A kind of voice service method and system based on countering type smart network
CN110600042A (en) * 2019-10-10 2019-12-20 公安部第三研究所 Method and system for recognizing gender of disguised voice speaker
CN111179915A (en) * 2019-12-30 2020-05-19 苏州思必驰信息科技有限公司 Age identification method and device based on voice
CN111599342A (en) * 2019-02-21 2020-08-28 北京京东尚科信息技术有限公司 Tone selecting method and system
CN111772422A (en) * 2020-06-12 2020-10-16 广州城建职业学院 Intelligent crib
CN111989742A (en) * 2018-04-13 2020-11-24 三菱电机株式会社 Speech recognition system and method for using speech recognition system
CN112530418A (en) * 2019-08-28 2021-03-19 北京声智科技有限公司 Voice wake-up method, device and related equipment
CN113143570A (en) * 2021-04-27 2021-07-23 福州大学 Multi-sensor fusion feedback adjustment snore stopping pillow

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1107227A2 (en) * 1999-11-30 2001-06-13 Sony Corporation Voice processing
JP2003345385A (en) * 2002-05-30 2003-12-03 Matsushita Electric Ind Co Ltd Voice recognition and discrimination device
CN1564245A (en) * 2004-04-20 2005-01-12 上海上悦通讯技术有限公司 Stunt method and device for baby's crying
CN1975856A (en) * 2006-10-30 2007-06-06 邹采荣 Speech emotion identifying method based on supporting vector machine
CN101201980A (en) * 2007-12-19 2008-06-18 北京交通大学 Remote Chinese language teaching system based on voice affection identification
US20100138223A1 (en) * 2007-03-26 2010-06-03 Takafumi Koshinaka Speech classification apparatus, speech classification method, and speech classification program
US8239194B1 (en) * 2011-07-28 2012-08-07 Google Inc. System and method for multi-channel multi-feature speech/noise classification for noise suppression
CN103117060A (en) * 2013-01-18 2013-05-22 中国科学院声学研究所 Modeling approach and modeling system of acoustic model used in speech recognition
CN103546503A (en) * 2012-07-10 2014-01-29 百度在线网络技术(北京)有限公司 Voice-based cloud social system, voice-based cloud social method and cloud analysis server

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1107227A2 (en) * 1999-11-30 2001-06-13 Sony Corporation Voice processing
JP2003345385A (en) * 2002-05-30 2003-12-03 Matsushita Electric Ind Co Ltd Voice recognition and discrimination device
CN1564245A (en) * 2004-04-20 2005-01-12 上海上悦通讯技术有限公司 Stunt method and device for baby's crying
CN1975856A (en) * 2006-10-30 2007-06-06 邹采荣 Speech emotion identifying method based on supporting vector machine
US20100138223A1 (en) * 2007-03-26 2010-06-03 Takafumi Koshinaka Speech classification apparatus, speech classification method, and speech classification program
CN101201980A (en) * 2007-12-19 2008-06-18 北京交通大学 Remote Chinese language teaching system based on voice affection identification
US8239194B1 (en) * 2011-07-28 2012-08-07 Google Inc. System and method for multi-channel multi-feature speech/noise classification for noise suppression
CN103546503A (en) * 2012-07-10 2014-01-29 百度在线网络技术(北京)有限公司 Voice-based cloud social system, voice-based cloud social method and cloud analysis server
CN103117060A (en) * 2013-01-18 2013-05-22 中国科学院声学研究所 Modeling approach and modeling system of acoustic model used in speech recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
怀进鹏: "《智能计算机研究进展 863计划智能计算机主题学术会议论文集》", 31 March 2001 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106686267A (en) * 2015-11-10 2017-05-17 中国移动通信集团公司 Method and system for implementing personalized voice service
CN107886955A (en) * 2016-09-29 2018-04-06 百度在线网络技术(北京)有限公司 A kind of personal identification method, device and the equipment of voice conversation sample
WO2018132187A1 (en) * 2017-01-12 2018-07-19 Qualcomm Incorporated Characteristic-based speech codebook selection
US10878831B2 (en) 2017-01-12 2020-12-29 Qualcomm Incorporated Characteristic-based speech codebook selection
CN106898355A (en) * 2017-01-17 2017-06-27 清华大学 A kind of method for distinguishing speek person based on two modelings
CN106898355B (en) * 2017-01-17 2020-04-14 北京华控智加科技有限公司 Speaker identification method based on secondary modeling
CN107316635A (en) * 2017-05-19 2017-11-03 科大讯飞股份有限公司 Audio recognition method and device, storage medium, electronic equipment
CN108701469B (en) * 2017-07-31 2023-06-20 深圳和而泰智能控制股份有限公司 Cough sound recognition method, device, and storage medium
CN108701469A (en) * 2017-07-31 2018-10-23 深圳和而泰智能家居科技有限公司 Cough sound recognition methods, equipment and storage medium
CN107680599A (en) * 2017-09-28 2018-02-09 百度在线网络技术(北京)有限公司 User property recognition methods, device and electronic equipment
CN108132995A (en) * 2017-12-20 2018-06-08 北京百度网讯科技有限公司 For handling the method and apparatus of audio-frequency information
CN107995370A (en) * 2017-12-21 2018-05-04 广东欧珀移动通信有限公司 Call control method, device and storage medium and mobile terminal
CN108109622A (en) * 2017-12-28 2018-06-01 武汉蛋玩科技有限公司 A kind of early education robot voice interactive education system and method
CN108186033A (en) * 2018-01-08 2018-06-22 杭州草莽科技有限公司 A kind of child's mood monitoring method and its system based on artificial intelligence
CN111989742A (en) * 2018-04-13 2020-11-24 三菱电机株式会社 Speech recognition system and method for using speech recognition system
CN109165284A (en) * 2018-08-22 2019-01-08 重庆邮电大学 A kind of financial field human-computer dialogue intension recognizing method based on big data
CN109102805A (en) * 2018-09-20 2018-12-28 北京长城华冠汽车技术开发有限公司 Voice interactive method, device and realization device
CN109065075A (en) * 2018-09-26 2018-12-21 广州势必可赢网络科技有限公司 A kind of method of speech processing, device, system and computer readable storage medium
CN111599342A (en) * 2019-02-21 2020-08-28 北京京东尚科信息技术有限公司 Tone selecting method and system
CN110021308A (en) * 2019-05-16 2019-07-16 北京百度网讯科技有限公司 Voice mood recognition methods, device, computer equipment and storage medium
CN110379441A (en) * 2019-07-01 2019-10-25 特斯联(北京)科技有限公司 A kind of voice service method and system based on countering type smart network
CN112530418A (en) * 2019-08-28 2021-03-19 北京声智科技有限公司 Voice wake-up method, device and related equipment
CN110600042B (en) * 2019-10-10 2020-10-23 公安部第三研究所 Method and system for recognizing gender of disguised voice speaker
CN110600042A (en) * 2019-10-10 2019-12-20 公安部第三研究所 Method and system for recognizing gender of disguised voice speaker
CN111179915A (en) * 2019-12-30 2020-05-19 苏州思必驰信息科技有限公司 Age identification method and device based on voice
CN111772422A (en) * 2020-06-12 2020-10-16 广州城建职业学院 Intelligent crib
CN113143570A (en) * 2021-04-27 2021-07-23 福州大学 Multi-sensor fusion feedback adjustment snore stopping pillow
CN113143570B (en) * 2021-04-27 2023-08-11 福州大学 Snore relieving pillow with multiple sensors integrated with feedback adjustment

Also Published As

Publication number Publication date
CN105761720B (en) 2020-01-07

Similar Documents

Publication Publication Date Title
CN105761720A (en) Interaction system based on voice attribute classification, and method thereof
CN109243494B (en) Children emotion recognition method based on multi-attention mechanism long-time memory network
Schuller et al. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture
Schuller et al. Speaker independent speech emotion recognition by ensemble classification
Tong et al. A comparative study of robustness of deep learning approaches for VAD
CN112581979B (en) Speech emotion recognition method based on spectrogram
Joshy et al. Automated dysarthria severity classification: A study on acoustic features and deep learning techniques
Ghai et al. Emotion recognition on speech signals using machine learning
Samantaray et al. A novel approach of speech emotion recognition with prosody, quality and derived features using SVM classifier for a class of North-Eastern Languages
Cnn Speech emotion recognition using convolutional neural network (CNN)
CN111899766B (en) Speech emotion recognition method based on optimization fusion of depth features and acoustic features
CN110085216A (en) A kind of vagitus detection method and device
CN111916066A (en) Random forest based voice tone recognition method and system
Caihua Research on multi-modal mandarin speech emotion recognition based on SVM
Přibil et al. GMM-based speaker age and gender classification in Czech and Slovak
Khan et al. Quranic reciter recognition: a machine learning approach
Cao et al. Speaker-independent speech emotion recognition based on random forest feature selection algorithm
Praksah et al. Analysis of emotion recognition system through speech signal using KNN, GMM & SVM classifier
CN111081273A (en) Voice emotion recognition method based on glottal wave signal feature extraction
Watrous Phoneme discrimination using connectionist networks
Ling An acoustic model for English speech recognition based on deep learning
CN108899046A (en) A kind of speech-emotion recognition method and system based on Multistage Support Vector Machine classification
Gomes et al. i-vector algorithm with Gaussian Mixture Model for efficient speech emotion recognition
Alshamsi et al. Automated speech emotion recognition on smart phones
CN113571095A (en) Speech emotion recognition method and system based on nested deep neural network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant