CN110265063B

CN110265063B - Lie detection method based on fixed duration speech emotion recognition sequence analysis

Info

Publication number: CN110265063B
Application number: CN201910659657.1A
Authority: CN
Inventors: 李玉峰; 黄永明
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-07-22
Filing date: 2019-07-22
Publication date: 2021-09-24
Anticipated expiration: 2039-07-22
Also published as: CN110265063A

Abstract

The invention discloses a lie detection method based on fixed duration speech emotion recognition sequence analysis, which mainly comprises the following steps: firstly, processing a recorded lie detection corpus to form two types of short-term corpora with equal length, so as to facilitate subsequent experiments; then preprocessing the voice such as pre-emphasis, framing and windowing; extracting voice emotional characteristics including fundamental tone frequency, MFCC, formants, short-time energy, short-time average zero-crossing rate, statistical characteristics of the short-time average zero-crossing rate and the like based on the time-frequency characteristics of voice; selecting features by using a decision tree to finally form a feature vector; training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result in a fixed time; and outputting the voice emotion result at fixed time to perform lie detection analysis. The invention adopts a decision tree method to select the characteristics, thus obtaining higher accuracy; and outputting a result in a vector form, and fully considering the emotion change condition in the lying process.

Description

Lie detection method based on fixed duration speech emotion recognition sequence analysis

Technical Field

The invention belongs to non-contact lie detection, particularly relates to a lie detection method based on fixed duration speech emotion recognition sequence analysis, and belongs to the technical field of lie detection.

Background

The voice is the most direct and convenient communication mode for people, the lie detection based on voice is non-contact, the recording equipment is simple, complex equipment is not needed, the preparation time is short, and a subject does not have great psychological pressure, so that the analysis accuracy is improved. This is of great interest for the studies performed herein. The speech contains much information about the speaker, such as the speaker's identity, gender and age, and even personality. Early studies showed that the speech contained the emotional state of the speaker, implying many reliable speech features in relation to specific emotions. When people are nervous and afraid, the fundamental frequency and the speech rate can rise, and when people are in a panic, the fundamental frequency and the speech rate can fall. Lie is a complex psychophysiological process, and speaking is accompanied by obvious emotion change, so that a great deal of psychological and emotional information can be obtained by using acoustic characteristics (fundamental frequency, voice duration, formant frequency and the like). Research on lie detection technologies with voice features as clues starts relatively late, most people pay attention to the influence of acoustic features on voice lie detection, but so far, no feature can be used for lie detection independently and effectively.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a lie detection method based on fixed-duration speech emotion recognition sequence analysis, which comprises the steps of extracting speech emotion characteristics according to time-frequency characteristics of speech, carrying out characteristic selection based on a decision tree to finally form a 14-dimensional characteristic vector, then using an SVM to finish training and predicting a corpus under a self-built Chinese lie detection corpus, outputting a speech emotion sequence according to speech duration, and using an HMM model to research the relation between the sequence and lie detection.

In order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a lie detection method based on fixed duration speech emotion recognition sequence analysis is characterized by comprising the following steps: the method comprises the following steps:

step 1: establishing a Chinese lie detection corpus and processing the corpus;

step 2: preprocessing the voice;

and step 3: extracting voice emotional characteristics according to the voice time-frequency characteristics;

and 4, step 4: completing feature selection based on a decision tree method to form a feature vector;

and 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result in a fixed time;

step 6: and outputting the voice emotion result at fixed time to perform lie detection analysis.

As a modification of the present invention, the step 1: establishing a Chinese lie detection corpus and processing the corpus; the method comprises the following specific steps:

step 1.1: establishing a Chinese lie detection corpus;

step 1.2: extracting true speech segments in the lie speech corpus;

step 1.3: real words and lie words are divided into equal-duration corpus sections, and labels are pasted, so that subsequent experiments are facilitated.

As a modification of the present invention, the step 2: preprocessing the voice, specifically as follows:

step 2.1: for discretizing the voice signal, pre-emphasis is carried out by using a first-order high-pass filter, wherein the expression of the first-order high-pass filter is as follows:

H(z)＝1-αz^-1,0.9<α<1.0

step 2.2: framing the signal, wherein the frame length is 30ms, and the frame shift is 10 ms;

step 2.3: selecting a Hamming window function, wherein the calculation formula is as follows:

as an improvement of the present invention, step 3: the speech emotion characteristics are extracted, specifically as follows,

step 3.1: extracting short-time energy, wherein the short-time energy refers to the energy of a frame of voice, and setting a voice signal as x (n) and an i frame of voice signal after framing processing of a windowing function omega (n) as y_i(n) then y_i(n) satisfies:

y_i(n)＝ω(n)*x((i-1)*inc+n)，1≤n≤L，1≤i≤fn

ω (n) is a window function; y is_i(n) is a frame number; inc is the frame shift length; fn is the total number of frames after the framing, the short-time energy of the voice signal of the ith frame is

Step 3.2: a short-term average zero-crossing rate is extracted, which represents the number of times the waveform of the signal in a frame of speech crosses a zero level. For discrete signals, if adjacent data changes a symbol once and does a zero crossing once, the speech signal is set as x (n), and the i-th frame speech signal after framing is y_i(n) a short-time average zero-crossing rate of

Step 3.3: extracting the pitch frequency, the pitch period being the duration of one time the vocal cords are opened and closed, the pitch frequency being its inverse, its Fourier transform being the Fourier transform of the signal sequence x (n) when it is x (n)

X(ω)＝FFT[x(n)]

Then the sequence

Balance

For cepstrum, abbreviated cepstrum, here FFT and FFT^-1Respectively a fourier transform and an inverse fourier transform,

the actual unit of (a) is time s.

Speech x (n) is obtained by glottal pulse excitation u (n) filtered by vocal tract response v (n), i.e.

x(n)＝u(n)*v(n)

The three quantities have a cepstrum

In the cepstrum, the glottal pulse excitation and the vocal tract response are relatively separated, and thus derived from

The glottal pulse excitation can be separated and recovered, so that a pitch period is obtained;

step 3.4: formants refer to regions with relatively concentrated energy in the frequency spectrum of sound, and are extracted by LPC method, and one frame signal x (n) of speech signal can be expressed by difference equation

The corresponding vocal tract transfer function is

Taking the power spectrum modulus value, and expressing by P (f)

P(f)＝|H(f)|²

Wherein

z^-1＝e^-jωT

The FFT is utilized to obtain the amplitude response of the power spectrum of any frequency, and the information of the formant is found from the amplitude response;

step 3.5: extracting MFCC parameters, preprocessing, converting the original signal x (n) into x_i(m), i represents the ith frame, and FFT is performed on the signal

X(i,k)＝FFT[x_i(m)]

Performing Mel filter bank processing, and calculating spectral line energy for each frame of FFT data

E(i,k)＝X(i,k)²

Calculating the energy in the Mel Filter Bank (triangular filters)

Logarithm of energy is obtained, and the MFCC parameters are obtained through Discrete Cosine Transform (DCT);

as a modification of the present invention, the step 4: completing feature selection based on a decision tree method to form a feature vector, and specifically inputting the following steps: training a data set D, a feature set A and a threshold value e; and (3) outputting: a decision tree T;

step 4.1: if all instances in D belong to the same class C_kIf T is a single junction tree, and C is set_kReturning T as the class of the node;

step 4.2: if A is an empty set, T is set as a single node tree, and the class C with the largest number of instances in D is set_kReturning T as the class of the node;

step 4.3: otherwise, calculating the information gain ratio of each feature in A to D, and selecting the feature A with the maximum information gain ratio_g；

Step 4.4: if A is_gIf the information gain ratio of (D) is less than the threshold e, T is set as a single node tree, and the class C with the largest number of instances in D is set as_kReturning T as the class of the node;

step 4.5: otherwise, for A_gEach possible value a of_iIn a_g＝a_iDividing D into several non-empty subsets D_iD is_iThe class with the maximum number of the middle instances is used as a mark, a sub-node is constructed, a tree T is formed by the nodes and the sub-nodes, and the T is returned;

step 4.6: for node i, with D_iFor training set, take A- { A_gRecursively calling the above steps to obtain a subtree T_i。

As a modification of the present invention, the step 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result at regular time,

step 5.1: completing a lie detection experiment by using an SVM (support vector machine) based on the self-built lie detection corpus to obtain a speech lie detection result;

step 5.2: based on the CASIA standard emotion library, a speech emotion prediction result is output at regular time, the result is a sequence with dimension being related to the duration of speech, a speech emotion recognition result is output every second, and then a 60-dimensional vector is obtained for 60s of speech.

As a modification of the present invention, the step 6: the lie detection analysis is performed by outputting the speech emotion result at regular time, specifically as follows,

step 6.1: dividing 80% of corpus samples into training samples and 20% of corpus samples into testing samples;

step 6.2: classifying the training samples according to labels, wherein the true speech part is 1, and the lie speech part is-1;

step 6.3: operating the corpus of the training sample according to the steps 1-5 to generate a voice emotion sequence;

step 6.4: describing an HMM model by using five elements, namely a hidden state S (lie condition), an observable state O (speech emotion sequence), a hidden state transition probability matrix A, an observed state transition matrix B and an initial state probability matrix pi;

step 6.5: training the HMM model using the training samples. Selecting several groups of observation sequences O (speech emotion sequences) from training samples, and solving parameters in a model lambda (A, B, Π) by using a Baum-Welch method; the specific method comprises the steps of initializing model parameters A, B and pi randomly, using a sample O to calculate and search more appropriate parameters, updating the parameters, and fitting the parameters by using the sample until the parameters are converged;

step 6.6: obtaining an HMM model λ ═ (a, B, Π), and predicting using the test sample itself; and selecting a voice emotion sequence from the voice emotion sequences, and solving the back state sequence by using a Viterbi method to obtain a predicted lie detection result.

Compared with the prior art, the invention has the following beneficial effects:

1. the speech emotion characteristics are various and comprise acoustic, rhythm and spectrum characteristics, but not every one is beneficial to lie detection, invalid characteristics are added, only training time is wasted, and the recognition rate is possibly reduced.

2. And (3) generating a voice emotion sequence related to duration by considering the influence of voice emotion on lie detection, and researching the relation between the voice emotion sequence and lie detection by utilizing an HMM (hidden Markov model).

Drawings

FIG. 1 is a flow chart of a lie detection method based on fixed duration speech emotion recognition sequence analysis according to the present invention;

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1: as shown in fig. 1, the invention provides a lie detection method based on fixed duration speech emotion recognition sequence analysis, which comprises the following detailed steps:

step 1: establishing a Chinese lie detection corpus and processing the corpus;

step 1.1: establishing a Chinese lie detection corpus;

step 1.2: extracting true speech segments in the lie speech corpus;

step 1.3: dividing the real words and the lie words into equal-duration corpus sections, and sticking labels to facilitate subsequent experiments;

step 2: preprocessing the voice;

H(z)＝1-αz^-1,0.9<α<1.0；

and step 3: extracting speech emotion characteristics;

y_i(n)＝ω(n)*x((i-1)*inc+n)，1≤n≤L，1≤i≤fn，

X(ω)＝FFT[x(n)]；

Then the sequence

Balance

the actual unit of (a) is time s.

x(n)＝u(n)*v(n)；

The three quantities have a cepstrum

The corresponding vocal tract transfer function is

Taking the power spectrum modulus value, and expressing by P (f)

P(f)＝|H(f)|²；

Wherein

z^-1＝e^-jωT；

step 3.5: extracting MFCC parameters, preprocessing, converting the original signal x (n) into x_i(m), i represents the ith frame, the opposite channelNumber is FFT

X(i,k)＝FFT[x_i(m)]；

E(i,k)＝X(i,k)²；

Calculating the energy in the Mel Filter Bank (triangular filters)

and 4, step 4: completing feature selection based on a decision tree method, forming a feature vector, and inputting: training a data set D, a feature set A and a threshold value e; and (3) outputting: a decision tree T;

step 4.6: for node i, with D_iFor training set, take A- { A_gRecursively calling the above steps to obtain a subtree T_i；

step 5.2: based on a CASIA standard emotion library, a speech emotion prediction result is output at regular time, the result is a sequence with dimension being related to speech duration, a speech emotion recognition result is output every second, and then a 60-dimensional vector is obtained for a 60s speech;

step 6: carrying out lie detection analysis by using the timed output voice emotion result;

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the above-mentioned technical solutions belong to the scope of the present invention.

Claims

1. A lie detection method based on fixed duration speech emotion recognition sequence analysis is characterized by comprising the following steps: the method comprises the following steps:

step 1: establishing a Chinese lie detection corpus and processing the corpus;

step 2: preprocessing the voice;

the step 1: establishing a Chinese lie detection corpus and processing the corpus; the method comprises the following specific steps:

step 1.1: establishing a Chinese lie detection corpus;

step 1.2: extracting true speech segments in the lie speech corpus;

the step 2: preprocessing the voice, specifically as follows:

H(z)＝1-αz^-1,0.9<α<1.0

and step 3: the speech emotion characteristics are extracted, specifically as follows,

y_i(n) ═ ω (n) × (i-1) × ((i-1) × inc + n), 1 ≦ n ≦ L, 1 ≦ i ≦ fn ω (n) for the window function; y is_i(n) is a frame number; inc is the frame shift length; fn is the total number of frames after the framing, the short-time energy of the voice signal of the ith frame is

Step 3.2: extracting a short-time average zero crossing rate which represents the number of times that the waveform of a signal in a frame of speech passes through a zero level; for discrete signals, if adjacent data changes a symbol once and does a zero crossing once, the speech signal is set as x (n), and the i-th frame speech signal after framing is y_i(n) a short-time average zero-crossing rate of

X(ω)＝FFT[x(n)]

Then the sequence

Balance

the actual unit of (a) is time s;

x(n)＝u(n)*v(n)

The three quantities have a cepstrum

The corresponding vocal tract transfer function is

Taking the power spectrum modulus value, and expressing by P (f)

P(f)＝|H(f)|²

Wherein

z^-1＝e^-jωT

X(i,k)＝FFT[x_i(m)]

E(i,k)＝X(i,k)²

The energy in the Mel filter bank is calculated, i.e. a triangular filter is selected,

the step 4: completing feature selection based on a decision tree method to form a feature vector, and specifically inputting the following steps: training a data set D, a feature set A and a threshold value e; and (3) outputting: a decision tree T;

step 4.1: if all instances in D belong to the same class C_kIf T is a single node tree, C is set_kReturning T as the class of the node;

Step 4.4: if A is_gIs less than the threshold e, T is setIs a single node tree, and the class C with the largest number of instances in D_kReturning T as the class of the node;

2. The lie detection method based on fixed-duration speech emotion recognition sequence analysis according to claim 1, characterized in that: the step 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result at regular time,

3. The lie detection method based on fixed-duration speech emotion recognition sequence analysis according to claim 1, characterized in that:

the step 6: the lie detection analysis is performed by outputting the speech emotion result at regular time, specifically as follows,

step 6.4: describing an HMM model by using five elements, wherein hidden state S is lie condition, observable sequencing column O is speech emotion sequence, hidden state transition probability matrix A, observation state transition matrix B and initial state probability matrix pi;

step 6.5: training an HMM model by using a training sample, namely selecting a plurality of groups of observable sequences O, namely speech emotion sequences from the training sample, and solving parameters in a model lambda (A, B, Π) by using a Baum-Welch method; the specific method comprises the steps of initializing model parameters A, B and pi randomly, using a sample O to calculate and search more appropriate parameters, updating the parameters, and fitting the parameters by using the sample until the parameters are converged;