CN104036776A

CN104036776A - Speech emotion identification method applied to mobile terminal

Info

Publication number: CN104036776A
Application number: CN201410218988.9A
Authority: CN
Inventors: 毛峡; 阿尔伯托·罗贝塔; 陈立江; 安东尼奥·托特拉; 彭一平; 马修·都德斯卡都; 保罗·马尔切利尼
Original assignee: ALBERTO ROVETTA
Current assignee: ALBERTO ROVETTA
Priority date: 2014-05-22
Filing date: 2014-05-22
Publication date: 2014-09-10

Abstract

The invention provides a method for extracting speech emotion information. The method is characterized in that speech data are acquired or transmitted by a mobile phone, a computer and a recording pen by virtue of a data acquisition or communication process, and the emotions of a speaker are identified by use of speaker unrelated and speaker related methods. The speaker unrelated emotion information extraction method adopted in the method for extracting speech emotion information is composed of two parts of speech database recording and speech emotion modeling, wherein the speech database recording part is taken as the reference for training an emotion identifier and comprises at least one emotion speech database; the speech emotion modeling part is used for establishing a speech emotion model as the emotion identifier. The accuracy rate of the speaker related emotion information extraction method adopted in the method for extracting speech emotion information can reach 80%, and the speaker related emotion information extraction method is used for identifying the emotions in a speech signal by adjusting internal parameters in a statistical manner. The method is capable of identifying other complex emotions by use of a group of special parameters describing basic emotions.

Description

A kind of speech-emotion recognition method that is applied to mobile terminal

Technical field

The present invention relates to a kind of human emotion's recognition methods, relate generally to signal processing, pattern-recognition and emotion and calculate field.

Background technology

Along with Computing ability improves constantly, artificial intelligence and algorithm for pattern recognition development, the ability that allows computing machine have to exchange with people is no longer out of reach.In person to person's daily interchange, voice, as a kind of main information carrier, are carrying a large amount of expressed information of speaker.Traditional speech recognition algorithm is only paid close attention to word content and the implication thereof of voice, and has ignored the emotion comprising in voice.

It is the research field of the methods such as perception and expression of research computing machine to human emotion that emotion is calculated.Emotion plays a part more sufficient weight in interpersonal communication, and by emotion communication, talk both sides can deepen know one another and trust, and creates more harmonious communication environment etc.Emotion is calculated and is given computing machine to experience human emotion's ability.Because speaker's phonetic feature can produce different variation along with its emotional state, wherein comprise a large amount of emotion informations, therefore by technology such as pattern-recognitions, by analyzing speech signal, excavate the information relevant to emotion and judge speaker affective state speech emotional recognition technology emotion calculate and man-machine interaction in all significant.

At present rest on theory stage for the research of speech emotional recognition technology, practical application is still few more.In addition, because accuracy of identification and pervasive degree are often difficult to balance, existing theoretical research is laid particular stress on more more, adopts the recognition methods that speaker is relevant to improve discrimination, and the irrelevant recognition methods of speaker improves universality.And still lack effective solution for this problem in actual applications.

Emotion identification method is applied to as the mobile terminal such as smart mobile phone, panel computer, can makes the natural harmony more alternately between user and equipment, the user who makes screen two ends expresses and experiences the other side's emotion in mode intuitively; Also can provide the user feeling monitoring platform based on mobile device, taking user's negative emotion as trigger message Real-Time Monitoring hazard event, guarantee personal safety simultaneously.

Summary of the invention

The present invention is that a kind of speech emotional for mobile computing platform is known method for distinguishing, and by processed voice data, identification user is current by the main emotion of phonetic representation.By in conjunction with the irrelevant two kind speech emotional recognition method relevant with speaker of speaker, greatly improve accuracy of identification and the pervasive ability of this method in application process.

Speech-emotion recognition method comprises the steps, as shown in Figure 1:

A) by the predetermined recording script of input, record environmental information, obtain speech data, speech data is made to the steps such as availability differentiation and emotion category division, record emotional speech database, as the benchmark of emotion recognition device;

B) by extracting the characteristic information of speech data in emotional speech database, choose specific Feature Combination, through Feature Dimension Reduction composing training collection; Set up the emotion recognition device of sandwich construction, set up and the irrelevant recognizer of training speaker with steps such as training set training emotion recognition devices;

C) by obtaining personal information corresponding to speech data in database, build individualized emotion model, set up emotion recognition device, then according to the speech data calibration emotion recognition device inner parameter in database, training emotion recognition device, thus the relevant recognizer of speaker obtained.

D) obtain user voice data by voice capture device;

E) this user's speech data is analyzed, judged whether this speech data can be by the relevant recognizer processing of speaker, if so, uses the relevant recognizer of speaker to carry out emotion recognition, and carries out step g); If not, carry out step f);

F) use the irrelevant recognizer of speaker to carry out emotion recognition;

G) obtain emotion recognition result.

Feature of the present invention is that its emotion identification method using has higher recognition accuracy, is generally not less than 80%.The model that the method is set up is identified emotion by the mode that adopts the special parameter (feature) of some that the entry in database is associated with certain specific emotional.In order to pursue higher confidence level and precision, feature used represents with statistical form.Native system is made up of sound pick-up outfit, processor, transmission system and electronic receiving equipment of possessing emotion recognition ability etc.This system can be with the data mode such as image, word transmission recognition result.Before bringing into use, the emotional speech that native system need to gather speaker carries out brief calibration.

Brief description of the drawings

Fig. 1 is this speech-emotion recognition method process flow diagram

One of feature that Fig. 2 extracts for this method, frame by frame auto-correlation density;

Fig. 3 is multilayer recognizer structural representation;

Fig. 4 is that the feature that obtained by the support vector machine separability in higher dimensional space is expressed;

Fig. 5 is multi-layer artificial neural network topological diagram;

Fig. 6 is each layer of discrimination result of multilayer recognizer;

Fig. 7 is a kind of typical application scenarios;

Fig. 8 is the image sequence of expressing forward emotion;

Fig. 9 is the petal figure that expresses speaker's affective state, and in figure, the length of every a slice " petal " represents a kind of intensity of emotion; Represent successively from left to right " stronger emotion balance state " " weak emotion balance state " and " mood imbalance state ";

Figure 10 is the expressions of various emotions in AV emotional space;

The core classification decision system model schematic diagram of Figure 11 emotion recognition algorithm;

Embodiment

Of the present invention adopted emotion identification method is divided into three phases: the identification of emotion, the parsing of emotion, the transmission of emotion.Adopt the defined rule of activity-dominance (Arousal-Valence, AV) emotional space as criterion of identification.The plane that AV emotional space is made up of two coordinate axis represents, wherein A is activity (comprises forward and oppositely), and V is dominance (comprises forward and oppositely), as shown in figure 10.The emotion identification method adopting in the present invention possesses novelty, and it does not do statistical study for a large amount of speakers, not by determining the mean value identification emotion of the parameter relevant to emotion.In traditional emotion identification method based on statistical study, same emotion may have multiple expression mode, it can be subject to the impact of the factors such as tone, the speaker's of voice individual character, sound channel characteristic, tone period, and is difficult to provide definite objective description.The present invention has two kinds of different emotion identification methods: the incoherent recognition methods of speaker, and it has and does not rely on customized information and wieldy advantage; The recognition methods that speaker is relevant, itself and speaker's identity is closely related, and can be applied to complex scene.Two kinds of methods compensate mutually, have therefore significantly improved accuracy and the pervasive degree of emotion recognition result.

The emotional speech database recording process that the present invention comprises, need to consider the various features such as the naturalness, feeling polarities, emotion intensity of speech data.In order to improve the naturalness of speech data as far as possible, reduce the impact of controlled condition for subject, before obtaining data, need experimental situation briefly to process, as adjustment ring environmental temperature and humidity, lamplight brightness, noise level etc., make it close to daily life environment.In addition, to as the high strength emotion such as angry, surprised while gathering, need to reasonably induce, make subject enter suitable affective state.In gatherer process, should strictly follow database and gather specification, should be responsible for by special messenger for the operation of acquisition system.For the speech data obtaining, should do one by one availability differentiate and emotion category division, by acquired results with in speech data together input database.

Two kinds of emotion identification methods that the present invention comprises are as described below:

Method one is the irrelevant emotion identification method of speaker, and it comprises Database, feature extraction, and intrinsic dimensionality approximately subtracts, the steps such as the structure of multilayer recognizer and sorter selection.Database comprises the speech entry of being recorded by multidigit age, speaker that background is different, express different emotions and forms in (or being called corpus).Speech entry, and as the original material of feature extraction.Characteristic extraction procedure is from voice signal, to extract the many kinds of parameters relevant to emotion, i.e. feature.In order to reach better robustness and precision, in this method, adopt various features, wherein a kind of feature (auto-correlation density) is as shown in Figure 2.Feature after extraction as foundation, is combined into different feature sets by cycle specificity Algorithms of Selecting and Fisher ratio.Fisher ratio is calculated by following formula:

F＝diag(S _a./S _b)

Wherein

S_{b} = \frac{1}{n} Σ_{j = 1}^{c} Σ_{i = 1}^{n_{j}} {| | x_{ij} - m_{j} | |}^{2},

S_{a} = \frac{1}{n} Σ_{j = 1}^{c} {n_{j} | | m_{j} - m | |}^{2} .

The sampling feature vectors of given n d dimension, total c class, represent j class sample set, x _ijrepresenting i sample in classification j, is x _jthe quantity of middle sample, definition m _jbe the average of j class sample, m is m _javerage.

Feature set is further processed through Feature Dimension Reduction process, removes in redundancy dimension keeping characteristics and accounts for 95% key message, and the characteristic dimension reduction algorithm using comprises main signature analysis, independent characteristic analysis, random neighborhood embedding etc.Feature after dimensionality reduction transfers to multilayer recognizer.In recognizer employing speech database, entry is adjusted the parameter of its internal sorting device as training benchmark.As shown in Figure 3, it is made up of multiple levels the structure of multilayer recognizer, and the identification of emotion has been worked in coordination with by each layer, and input voice are divided into two subclasses by ground floor, and each layer of segmentation gradually, finally obtains accurate recognition result afterwards.The foundation of hierarchical model is defined by following formula:

Wherein c kind emotion is divided into two groups, total M kind method, represent the number of combinations of optional i class from c class.

Every one deck all comprises one or more sorters.The training process of single sorter needs iteration and selects optimal feature set to obtain optimal result output.The sorter classification that this method is used can be support vector machine (Fig. 4) and multi-layer artificial neural network (Fig. 5).The parameter of support vector machine is selected to be defined by following formula:

D = \exp (K \times \frac{1}{N_{f}} Σ_{i = 1}^{N_{f}} F_{i})

Wherein D is penalty factor, and K is constant, N _ffor the feature quantity that this sorter adopts, F _ifor the Fisher ratio of each input feature vector.The value of penalty factor and concrete application about: when the differentiation better performances of affective characteristics, can suitably improve penalty factor, to improve the classification capacity of support vector machine; In the time of the differentiation poor-performing of affective characteristics, must suitably reduce penalty factor, to ensure that sorter has good Generalization Capability.

Complete multilayer recognizer can provide for the uncorrelated situation of speaker approximately 80% emotion recognition rate, as shown in Figure 6.The feature extraction that this method adopts, characteristic dimension approximately subtracts and the Multilayer Classifier of having trained forms a complete system, in order to identify the emotion information that comprises in any voice and by its output.

Method two is the relevant emotion identification method of speaker, its control subject using a specific speaker as system.The emotion recognition rate of the method is not less than 80%, therefore can be applied in field of speech recognition, and while especially only identifying several basic emotion, identifying is very reliable.The application process of the method is divided into following several link: speaker's personal information gathers (as sex, the age etc.) and secret protection; System initialization, comprises speaker and expresses a kind of particular emotion and inner parameter calibration with the pronunciation of clear and definite, produces individualized emotion information according to speaker's feature; The record of customized information and storage; Call emotion identification method identification speaker emotion; According to programmed logic periodic triggers emotion recognition; Judge and select etc. according to recognition result.System initialization only needs very short time (several seconds), and system just can normally be moved afterwards.Contrary with this method, the processing time (several minutes) that other system need to be longer and more speaker are in order to initialization.The recognition result of this method can be expressed and transmission by word, picture or other data modes, and is applied to the electronic equipments such as mobile terminal, TV, panel computer.

Till the present patent application, also there is not the speech emotional interactive mode with the electronic equipments such as mobile terminal and the artificial application platform of machine.Current system need to be inferred its psychological condition by the health of physiological signal sensor monitoring user, but might not be closely related with emotion, and therefore the present invention possesses novelty.

By combined above-mentioned two kinds of emotion identification methods, can form the recognition methods that can accept speaker undepandent phonetic order, and without determining whether it registers at native system.For unregistered speaker, can adopt method one to do universality identification; Can adopt the higher and easier method two of recognition accuracy to make specific recognition for registered user.

Typical application process of the present invention is described as follows:

As shown in Figure 7, user 1 accurately expresses by voice signal 4 emotion being produced by brain 2.This voice signal is received and is changed by receiving system 5.Receiving system 5 can be that mobile terminal (as mobile phone, panel computer etc.) or other have and receive and the equipment of processes voice signals ability.By adopting emotion identification method proposed by the invention, the parameter of analyzing speech signal, this recognition of devices goes out user 1 emotion, recognition result is directly shown or transferred to signal receiver 7 by the modes such as 6 that transmit by modes such as images on equipment, and be forwarded to server 8 and internet 9.Foregoing description is only a kind of possible application scenarios, and it is not unique, and can be applicable to any communication system that possesses data acquisition and transmittability.The signal 6 via satellite mode of signal 10 transfers to user 12 mobile terminal 11 (or possess the receiving system of equal capability, as TV, computing machine etc.).User 12 receives the information of the signal 13 that represents emotion, can understand immediately the current emotion of user 1, and carry out conversation voice without any need for manual operation or with user 1.Here, the canonical form of signal transmission 6 can be colour picture, as shown in Figure 8 and Figure 9.Native system is according to user's different identity and affected by varying environment, and its average emotion recognition rate is not less than 80%, is generally about 90%.

Fig. 8 is the sequence of pictures showing emotion, and a fresh flower bursting forth can be expressed user's forward emotion.

Fig. 9 be proposed by the invention a kind of be image mode in order to express the method for user's emotion, be called petal figure.Every width image is made up of 6 angled flap shapes (hereinafter referred to as petal), and the size of petal can be expressed the intensity of certain emotion of user, therefore integral image can embody the current affective state of user.In Fig. 9 a, each piece of petal is equal in length, therefore represents a kind of emotion balance state, and wherein the petal of different colours is in order to represent different emotions.What Fig. 9 b represented is also a kind of emotion balance state, but its degree is more weak, more trends towards emotionless neutral state.Fig. 9 c represents wherein that two kinds of emotions are occupied an leading position and is very strong, and user is in a kind of mood imbalance state.This method has dirigibility, and it is not limited to the simple division to emotion, and it has applied mathematical model proposed by the invention, " identification of emotion percent " and " identification quadrant ".

This model is as described below:

There are four quadrants in the two-dimensional coordinate system of AV emotional space, as shown in figure 10.Each quadrant by 1 to 4 numbering, represents respectively different affective state combinations with counterclockwise.In figure, represent the basic emotion classification in order to understand human emotion with point.Every kind of emotion is corresponding one by one with a point, and each point and voice signal produce through mathematical procedures such as emotion recognition one group of parameter or feature are corresponding.In actual applications, the emotion of excitement and forward is in upper right (first) quadrant; The emotion of pessimism and negative sense is in lower-left (the 3rd) quadrant.Each point all represents a kind of speaker's emotion observing.Between speaker, have individual difference, its mapping has singularity, but the form that this method combines by emotion can be identified any emotion.In addition,, owing to having adopted other redundancy guarantee and compensatory algorithm in identifying, this method can activity and the dominance of perception user in test.By large measure feature, the confidence level of this method is not less than 83%, higher than other existing systems.

Figure 11 is the core classification decision system model schematic diagram of emotion recognition algorithm.As shown in figure 11, a sorter responds and classifies the voice in incoming event, has simplified the decision logic structure of disposal system, has saved and has created the treatment mechanism with loaded down with trivial details condition, as necessity of decision tree.This model scans processed voice and divides frame by the modes such as vocabulary length are set, and it is carried out to feature extraction, normalization, and characteristic dimension approximately subtracts and the operation such as affective style identification, draws the emotion combination in voice.This model provides the ability that defines afterwards emotion classification in test, and the classification that therefore this method can draw emotion from emotion combination, occupies the emotion that percent is the highest in combination.For to obtain the model of emotion number percent combination and the identification of AV emotional space be separate and can parallel work-flow, finally export three results: emotion combination, main emotion and emotion quadrant of living in.

The present invention can also be applied in medical industry, in doctor in hospital outpatient and patient's diagnosis and treatment process, and monitoring both sides mood; Or for mental disease patient's long term monitoring and therapeutic process.Also can be used for instant messaging field, provide real-time online without the affective interaction mode of waiting for; Or create for being between the family and friends that cannot meet in alien land the emotion communication platform that is simple and easy to use.The empty nest old man who sharply increases for current quantity, the present invention can be in order to pass on and to mediate the mood such as unhappy and depressed of its solitary generation.The present invention can record the historic state of user's mood, is conveniently connected and integrates with other databases.

The present invention can by but be not limited only to current intelligent mobile terminal, as apple iOS, the operating system platforms such as the Android of Google and the Windows of Microsoft, or other communication systems, the mode of applying (also claiming App) by movement realizes.The concept with novelty proposed by the invention and system contribute to promote the application of emotional expression in mobile computing field.

Claims

1. the speech emotional for mobile computing platform is known method for distinguishing; With the irrelevant two kind mode places relevant with speaker of speaker

Reason voice messaging, identification speaker is current by the main emotion of phonetic representation, comprises the following steps:

A) record emotional speech database, as the benchmark of emotion recognition device;

B) set up and train the irrelevant recognizer of speaker, and building multilayer detector model;

C) set up and training speaker relevant recognizer, using emotional speech database or user personalized information as benchmark, correction recognizer inner parameter;

D) obtain user voice data;

F) use the irrelevant recognizer of speaker to carry out emotion recognition;

G) obtain emotion recognition result.

2. the method for claim 1, the step of recording emotional speech database comprises:

1) input predetermined recording script;

2) record the information of playback environ-ment;

3) according to the requirement of recording script, obtain speech data;

4) speech data is made to availability and differentiate, applicable data is included into database;

5), to every data sense category division of admiring, the rower of going forward side by side is fixed;

6) repeating step 3) to 5), until obtain whole speech datas of emotional speech database.

3. the method for claim 1, foundation and the training process of the irrelevant recognizer of speaker are as follows:

1) characteristic information of speech data in extraction emotional speech database;

2) choose specific Feature Combination, constitutive characteristic collection;

3) feature set is carried out to Feature Dimension Reduction, obtain training set;

4) set up the irrelevant emotion recognition device of speaker by multiple emotion classifiers, this emotion recognition device is sandwich construction, and every one deck successively segments the emotion classification of voice signal;

5) use above-mentioned training set training emotion recognition device.

4. the method for claim 1, foundation and the training process of the relevant recognizer of speaker are as follows:

1) obtain personal information corresponding to speech data in database;

2) according to this personal information, build individualized emotion model, set up emotion recognition device;

3) according to the speech data calibration emotion recognition device inner parameter in database, training emotion recognition device.

5. the method for claim 1, in steps d) and step e) between, comprise the steps:

1) obtain user's environmental information around, automatically adjust the parameter of emotion recognition device;

2) obtained speech data is carried out to noise reduction process and the optimization process for described environmental information.

6. method as claimed in claim 2, the information that records playback environ-ment comprises environment temperature, humidity, noise level, brightness.

7. method as claimed in claim 3, characteristic information comprises: energy, zero-crossing rate, fundamental frequency, resonance peak, frequency spectrum barycenter, cut frequency, auto-correlation density, fractal dimension, Mel cepstrum coefficient.

8. method as claimed in claim 4, personal information comprises: sex, age, occupation, nationality.

9. the method as described in one of as any in claim 1-8, can be used for:

Smart machine, this equipment can perception user with the expressed emotion of voice, and then change the Action logic of this smart machine,

Make user by this smart machine of speech emotional control; Or

Automatic alarm, this equipment can perceptual speech in dangerous emotion, and real-time Transmission is to Surveillance center; Or

Medical Devices, collaborative medical electric system monitoring patient's emotional state, and the affection data of demonstration in real time, monitoring, processing store patient.