CN110265063B - Lie detection method based on fixed duration speech emotion recognition sequence analysis - Google Patents
Lie detection method based on fixed duration speech emotion recognition sequence analysis Download PDFInfo
- Publication number
- CN110265063B CN110265063B CN201910659657.1A CN201910659657A CN110265063B CN 110265063 B CN110265063 B CN 110265063B CN 201910659657 A CN201910659657 A CN 201910659657A CN 110265063 B CN110265063 B CN 110265063B
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- corpus
- emotion
- lie detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 49
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 14
- 238000012300 Sequence Analysis Methods 0.000 title claims abstract description 11
- 230000008451 emotion Effects 0.000 claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 32
- 238000000034 method Methods 0.000 claims abstract description 24
- 238000009432 framing Methods 0.000 claims abstract description 13
- 238000012545 processing Methods 0.000 claims abstract description 12
- 238000003066 decision tree Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 238000004458 analytical method Methods 0.000 claims abstract description 7
- 238000002474 experimental method Methods 0.000 claims abstract description 7
- 230000002996 emotional effect Effects 0.000 claims abstract description 5
- 230000004044 response Effects 0.000 claims description 12
- 238000012706 support-vector machine Methods 0.000 claims description 12
- 238000001228 spectrum Methods 0.000 claims description 10
- 230000005284 excitation Effects 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000001755 vocal effect Effects 0.000 claims description 9
- 230000037433 frameshift Effects 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 6
- 108010076504 Protein Sorting Signals Proteins 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 210000001260 vocal cord Anatomy 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims 1
- 230000008859 change Effects 0.000 abstract description 2
- 230000008569 process Effects 0.000 abstract description 2
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000003304 psychophysiological effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
Images
Classifications
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/16—Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
- A61B5/164—Lie detection
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/48—Other medical applications
- A61B5/4803—Speech analysis specially adapted for diagnostic purposes
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/7246—Details of waveform analysis using correlation, e.g. template matching or determination of similarity
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Veterinary Medicine (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- Surgery (AREA)
- Animal Behavior & Ethology (AREA)
- Biomedical Technology (AREA)
- Public Health (AREA)
- Psychiatry (AREA)
- Pathology (AREA)
- Heart & Thoracic Surgery (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physiology (AREA)
- Artificial Intelligence (AREA)
- Developmental Disabilities (AREA)
- Educational Technology (AREA)
- Psychology (AREA)
- Social Psychology (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a lie detection method based on fixed duration speech emotion recognition sequence analysis, which mainly comprises the following steps: firstly, processing a recorded lie detection corpus to form two types of short-term corpora with equal length, so as to facilitate subsequent experiments; then preprocessing the voice such as pre-emphasis, framing and windowing; extracting voice emotional characteristics including fundamental tone frequency, MFCC, formants, short-time energy, short-time average zero-crossing rate, statistical characteristics of the short-time average zero-crossing rate and the like based on the time-frequency characteristics of voice; selecting features by using a decision tree to finally form a feature vector; training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result in a fixed time; and outputting the voice emotion result at fixed time to perform lie detection analysis. The invention adopts a decision tree method to select the characteristics, thus obtaining higher accuracy; and outputting a result in a vector form, and fully considering the emotion change condition in the lying process.
Description
Technical Field
The invention belongs to non-contact lie detection, particularly relates to a lie detection method based on fixed duration speech emotion recognition sequence analysis, and belongs to the technical field of lie detection.
Background
The voice is the most direct and convenient communication mode for people, the lie detection based on voice is non-contact, the recording equipment is simple, complex equipment is not needed, the preparation time is short, and a subject does not have great psychological pressure, so that the analysis accuracy is improved. This is of great interest for the studies performed herein. The speech contains much information about the speaker, such as the speaker's identity, gender and age, and even personality. Early studies showed that the speech contained the emotional state of the speaker, implying many reliable speech features in relation to specific emotions. When people are nervous and afraid, the fundamental frequency and the speech rate can rise, and when people are in a panic, the fundamental frequency and the speech rate can fall. Lie is a complex psychophysiological process, and speaking is accompanied by obvious emotion change, so that a great deal of psychological and emotional information can be obtained by using acoustic characteristics (fundamental frequency, voice duration, formant frequency and the like). Research on lie detection technologies with voice features as clues starts relatively late, most people pay attention to the influence of acoustic features on voice lie detection, but so far, no feature can be used for lie detection independently and effectively.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a lie detection method based on fixed-duration speech emotion recognition sequence analysis, which comprises the steps of extracting speech emotion characteristics according to time-frequency characteristics of speech, carrying out characteristic selection based on a decision tree to finally form a 14-dimensional characteristic vector, then using an SVM to finish training and predicting a corpus under a self-built Chinese lie detection corpus, outputting a speech emotion sequence according to speech duration, and using an HMM model to research the relation between the sequence and lie detection.
In order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a lie detection method based on fixed duration speech emotion recognition sequence analysis is characterized by comprising the following steps: the method comprises the following steps:
step 1: establishing a Chinese lie detection corpus and processing the corpus;
step 2: preprocessing the voice;
and step 3: extracting voice emotional characteristics according to the voice time-frequency characteristics;
and 4, step 4: completing feature selection based on a decision tree method to form a feature vector;
and 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result in a fixed time;
step 6: and outputting the voice emotion result at fixed time to perform lie detection analysis.
As a modification of the present invention, the step 1: establishing a Chinese lie detection corpus and processing the corpus; the method comprises the following specific steps:
step 1.1: establishing a Chinese lie detection corpus;
step 1.2: extracting true speech segments in the lie speech corpus;
step 1.3: real words and lie words are divided into equal-duration corpus sections, and labels are pasted, so that subsequent experiments are facilitated.
As a modification of the present invention, the step 2: preprocessing the voice, specifically as follows:
step 2.1: for discretizing the voice signal, pre-emphasis is carried out by using a first-order high-pass filter, wherein the expression of the first-order high-pass filter is as follows:
H(z)=1-αz-1,0.9<α<1.0
step 2.2: framing the signal, wherein the frame length is 30ms, and the frame shift is 10 ms;
step 2.3: selecting a Hamming window function, wherein the calculation formula is as follows:
as an improvement of the present invention, step 3: the speech emotion characteristics are extracted, specifically as follows,
step 3.1: extracting short-time energy, wherein the short-time energy refers to the energy of a frame of voice, and setting a voice signal as x (n) and an i frame of voice signal after framing processing of a windowing function omega (n) as yi(n) then yi(n) satisfies:
yi(n)=ω(n)*x((i-1)*inc+n),1≤n≤L,1≤i≤fn
ω (n) is a window function; y isi(n) is a frame number; inc is the frame shift length; fn is the total number of frames after the framing, the short-time energy of the voice signal of the ith frame is
Step 3.2: a short-term average zero-crossing rate is extracted, which represents the number of times the waveform of the signal in a frame of speech crosses a zero level. For discrete signals, if adjacent data changes a symbol once and does a zero crossing once, the speech signal is set as x (n), and the i-th frame speech signal after framing is yi(n) a short-time average zero-crossing rate of
Step 3.3: extracting the pitch frequency, the pitch period being the duration of one time the vocal cords are opened and closed, the pitch frequency being its inverse, its Fourier transform being the Fourier transform of the signal sequence x (n) when it is x (n)
X(ω)=FFT[x(n)]
Then the sequence
BalanceFor cepstrum, abbreviated cepstrum, here FFT and FFT-1Respectively a fourier transform and an inverse fourier transform,the actual unit of (a) is time s.
Speech x (n) is obtained by glottal pulse excitation u (n) filtered by vocal tract response v (n), i.e.
x(n)=u(n)*v(n)
The three quantities have a cepstrum
In the cepstrum, the glottal pulse excitation and the vocal tract response are relatively separated, and thus derived fromThe glottal pulse excitation can be separated and recovered, so that a pitch period is obtained;
step 3.4: formants refer to regions with relatively concentrated energy in the frequency spectrum of sound, and are extracted by LPC method, and one frame signal x (n) of speech signal can be expressed by difference equation
The corresponding vocal tract transfer function is
Taking the power spectrum modulus value, and expressing by P (f)
P(f)=|H(f)|2
Wherein
z-1=e-jωT
The FFT is utilized to obtain the amplitude response of the power spectrum of any frequency, and the information of the formant is found from the amplitude response;
step 3.5: extracting MFCC parameters, preprocessing, converting the original signal x (n) into xi(m), i represents the ith frame, and FFT is performed on the signal
X(i,k)=FFT[xi(m)]
Performing Mel filter bank processing, and calculating spectral line energy for each frame of FFT data
E(i,k)=X(i,k)2
Calculating the energy in the Mel Filter Bank (triangular filters)
Logarithm of energy is obtained, and the MFCC parameters are obtained through Discrete Cosine Transform (DCT);
as a modification of the present invention, the step 4: completing feature selection based on a decision tree method to form a feature vector, and specifically inputting the following steps: training a data set D, a feature set A and a threshold value e; and (3) outputting: a decision tree T;
step 4.1: if all instances in D belong to the same class CkIf T is a single junction tree, and C is setkReturning T as the class of the node;
step 4.2: if A is an empty set, T is set as a single node tree, and the class C with the largest number of instances in D is setkReturning T as the class of the node;
step 4.3: otherwise, calculating the information gain ratio of each feature in A to D, and selecting the feature A with the maximum information gain ratiog;
Step 4.4: if A isgIf the information gain ratio of (D) is less than the threshold e, T is set as a single node tree, and the class C with the largest number of instances in D is set askReturning T as the class of the node;
step 4.5: otherwise, for AgEach possible value a ofiIn ag=aiDividing D into several non-empty subsets DiD isiThe class with the maximum number of the middle instances is used as a mark, a sub-node is constructed, a tree T is formed by the nodes and the sub-nodes, and the T is returned;
step 4.6: for node i, with DiFor training set, take A- { AgRecursively calling the above steps to obtain a subtree Ti。
As a modification of the present invention, the step 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result at regular time,
step 5.1: completing a lie detection experiment by using an SVM (support vector machine) based on the self-built lie detection corpus to obtain a speech lie detection result;
step 5.2: based on the CASIA standard emotion library, a speech emotion prediction result is output at regular time, the result is a sequence with dimension being related to the duration of speech, a speech emotion recognition result is output every second, and then a 60-dimensional vector is obtained for 60s of speech.
As a modification of the present invention, the step 6: the lie detection analysis is performed by outputting the speech emotion result at regular time, specifically as follows,
step 6.1: dividing 80% of corpus samples into training samples and 20% of corpus samples into testing samples;
step 6.2: classifying the training samples according to labels, wherein the true speech part is 1, and the lie speech part is-1;
step 6.3: operating the corpus of the training sample according to the steps 1-5 to generate a voice emotion sequence;
step 6.4: describing an HMM model by using five elements, namely a hidden state S (lie condition), an observable state O (speech emotion sequence), a hidden state transition probability matrix A, an observed state transition matrix B and an initial state probability matrix pi;
step 6.5: training the HMM model using the training samples. Selecting several groups of observation sequences O (speech emotion sequences) from training samples, and solving parameters in a model lambda (A, B, Π) by using a Baum-Welch method; the specific method comprises the steps of initializing model parameters A, B and pi randomly, using a sample O to calculate and search more appropriate parameters, updating the parameters, and fitting the parameters by using the sample until the parameters are converged;
step 6.6: obtaining an HMM model λ ═ (a, B, Π), and predicting using the test sample itself; and selecting a voice emotion sequence from the voice emotion sequences, and solving the back state sequence by using a Viterbi method to obtain a predicted lie detection result.
Compared with the prior art, the invention has the following beneficial effects:
1. the speech emotion characteristics are various and comprise acoustic, rhythm and spectrum characteristics, but not every one is beneficial to lie detection, invalid characteristics are added, only training time is wasted, and the recognition rate is possibly reduced.
2. And (3) generating a voice emotion sequence related to duration by considering the influence of voice emotion on lie detection, and researching the relation between the voice emotion sequence and lie detection by utilizing an HMM (hidden Markov model).
Drawings
FIG. 1 is a flow chart of a lie detection method based on fixed duration speech emotion recognition sequence analysis according to the present invention;
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1: as shown in fig. 1, the invention provides a lie detection method based on fixed duration speech emotion recognition sequence analysis, which comprises the following detailed steps:
step 1: establishing a Chinese lie detection corpus and processing the corpus;
step 1.1: establishing a Chinese lie detection corpus;
step 1.2: extracting true speech segments in the lie speech corpus;
step 1.3: dividing the real words and the lie words into equal-duration corpus sections, and sticking labels to facilitate subsequent experiments;
step 2: preprocessing the voice;
step 2.1: for discretizing the voice signal, pre-emphasis is carried out by using a first-order high-pass filter, wherein the expression of the first-order high-pass filter is as follows:
H(z)=1-αz-1,0.9<α<1.0;
step 2.2: framing the signal, wherein the frame length is 30ms, and the frame shift is 10 ms;
step 2.3: selecting a Hamming window function, wherein the calculation formula is as follows:
and step 3: extracting speech emotion characteristics;
step 3.1: extracting short-time energy, wherein the short-time energy refers to the energy of a frame of voice, and setting a voice signal as x (n) and an i frame of voice signal after framing processing of a windowing function omega (n) as yi(n) then yi(n) satisfies:
yi(n)=ω(n)*x((i-1)*inc+n),1≤n≤L,1≤i≤fn,
ω (n) is a window function; y isi(n) is a frame number; inc is the frame shift length; fn is the total number of frames after the framing, the short-time energy of the voice signal of the ith frame is
Step 3.2: a short-term average zero-crossing rate is extracted, which represents the number of times the waveform of the signal in a frame of speech crosses a zero level. For discrete signals, if adjacent data changes a symbol once and does a zero crossing once, the speech signal is set as x (n), and the i-th frame speech signal after framing is yi(n) a short-time average zero-crossing rate of
Step 3.3: extracting the pitch frequency, the pitch period being the duration of one time the vocal cords are opened and closed, the pitch frequency being its inverse, its Fourier transform being the Fourier transform of the signal sequence x (n) when it is x (n)
X(ω)=FFT[x(n)];
Then the sequence
BalanceFor cepstrum, abbreviated cepstrum, here FFT and FFT-1Respectively a fourier transform and an inverse fourier transform,the actual unit of (a) is time s.
Speech x (n) is obtained by glottal pulse excitation u (n) filtered by vocal tract response v (n), i.e.
x(n)=u(n)*v(n);
The three quantities have a cepstrum
In the cepstrum, the glottal pulse excitation and the vocal tract response are relatively separated, and thus derived fromThe glottal pulse excitation can be separated and recovered, so that a pitch period is obtained;
step 3.4: formants refer to regions with relatively concentrated energy in the frequency spectrum of sound, and are extracted by LPC method, and one frame signal x (n) of speech signal can be expressed by difference equation
The corresponding vocal tract transfer function is
Taking the power spectrum modulus value, and expressing by P (f)
P(f)=|H(f)|2;
Wherein
z-1=e-jωT;
The FFT is utilized to obtain the amplitude response of the power spectrum of any frequency, and the information of the formant is found from the amplitude response;
step 3.5: extracting MFCC parameters, preprocessing, converting the original signal x (n) into xi(m), i represents the ith frame, the opposite channelNumber is FFT
X(i,k)=FFT[xi(m)];
Performing Mel filter bank processing, and calculating spectral line energy for each frame of FFT data
E(i,k)=X(i,k)2;
Calculating the energy in the Mel Filter Bank (triangular filters)
Logarithm of energy is obtained, and the MFCC parameters are obtained through Discrete Cosine Transform (DCT);
and 4, step 4: completing feature selection based on a decision tree method, forming a feature vector, and inputting: training a data set D, a feature set A and a threshold value e; and (3) outputting: a decision tree T;
step 4.1: if all instances in D belong to the same class CkIf T is a single junction tree, and C is setkReturning T as the class of the node;
step 4.2: if A is an empty set, T is set as a single node tree, and the class C with the largest number of instances in D is setkReturning T as the class of the node;
step 4.3: otherwise, calculating the information gain ratio of each feature in A to D, and selecting the feature A with the maximum information gain ratiog;
Step 4.4: if A isgIf the information gain ratio of (D) is less than the threshold e, T is set as a single node tree, and the class C with the largest number of instances in D is set askReturning T as the class of the node;
step 4.5: otherwise, for AgEach possible value a ofiIn ag=aiDividing D into several non-empty subsets DiD isiThe class with the maximum number of the middle instances is used as a mark, a sub-node is constructed, a tree T is formed by the nodes and the sub-nodes, and the T is returned;
step 4.6: for node i, with DiFor training set, take A- { AgRecursively calling the above steps to obtain a subtree Ti;
And 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result in a fixed time;
step 5.1: completing a lie detection experiment by using an SVM (support vector machine) based on the self-built lie detection corpus to obtain a speech lie detection result;
step 5.2: based on a CASIA standard emotion library, a speech emotion prediction result is output at regular time, the result is a sequence with dimension being related to speech duration, a speech emotion recognition result is output every second, and then a 60-dimensional vector is obtained for a 60s speech;
step 6: carrying out lie detection analysis by using the timed output voice emotion result;
step 6.1: dividing 80% of corpus samples into training samples and 20% of corpus samples into testing samples;
step 6.2: classifying the training samples according to labels, wherein the true speech part is 1, and the lie speech part is-1;
step 6.3: operating the corpus of the training sample according to the steps 1-5 to generate a voice emotion sequence;
step 6.4: describing an HMM model by using five elements, namely a hidden state S (lie condition), an observable state O (speech emotion sequence), a hidden state transition probability matrix A, an observed state transition matrix B and an initial state probability matrix pi;
step 6.5: training the HMM model using the training samples. Selecting several groups of observation sequences O (speech emotion sequences) from training samples, and solving parameters in a model lambda (A, B, Π) by using a Baum-Welch method; the specific method comprises the steps of initializing model parameters A, B and pi randomly, using a sample O to calculate and search more appropriate parameters, updating the parameters, and fitting the parameters by using the sample until the parameters are converged;
step 6.6: obtaining an HMM model λ ═ (a, B, Π), and predicting using the test sample itself; and selecting a voice emotion sequence from the voice emotion sequences, and solving the back state sequence by using a Viterbi method to obtain a predicted lie detection result.
It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the above-mentioned technical solutions belong to the scope of the present invention.
Claims (3)
1. A lie detection method based on fixed duration speech emotion recognition sequence analysis is characterized by comprising the following steps: the method comprises the following steps:
step 1: establishing a Chinese lie detection corpus and processing the corpus;
step 2: preprocessing the voice;
and step 3: extracting voice emotional characteristics according to the voice time-frequency characteristics;
and 4, step 4: completing feature selection based on a decision tree method to form a feature vector;
and 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result in a fixed time;
step 6: carrying out lie detection analysis by using the timed output voice emotion result;
the step 1: establishing a Chinese lie detection corpus and processing the corpus; the method comprises the following specific steps:
step 1.1: establishing a Chinese lie detection corpus;
step 1.2: extracting true speech segments in the lie speech corpus;
step 1.3: dividing the real words and the lie words into equal-duration corpus sections, and sticking labels to facilitate subsequent experiments;
the step 2: preprocessing the voice, specifically as follows:
step 2.1: for discretizing the voice signal, pre-emphasis is carried out by using a first-order high-pass filter, wherein the expression of the first-order high-pass filter is as follows:
H(z)=1-αz-1,0.9<α<1.0
step 2.2: framing the signal, wherein the frame length is 30ms, and the frame shift is 10 ms;
step 2.3: selecting a Hamming window function, wherein the calculation formula is as follows:
and step 3: the speech emotion characteristics are extracted, specifically as follows,
step 3.1: extracting short-time energy, wherein the short-time energy refers to the energy of a frame of voice, and setting a voice signal as x (n) and an i frame of voice signal after framing processing of a windowing function omega (n) as yi(n) then yi(n) satisfies:
yi(n) ═ ω (n) × (i-1) × ((i-1) × inc + n), 1 ≦ n ≦ L, 1 ≦ i ≦ fn ω (n) for the window function; y isi(n) is a frame number; inc is the frame shift length; fn is the total number of frames after the framing, the short-time energy of the voice signal of the ith frame is
Step 3.2: extracting a short-time average zero crossing rate which represents the number of times that the waveform of a signal in a frame of speech passes through a zero level; for discrete signals, if adjacent data changes a symbol once and does a zero crossing once, the speech signal is set as x (n), and the i-th frame speech signal after framing is yi(n) a short-time average zero-crossing rate of
Step 3.3: extracting the pitch frequency, the pitch period being the duration of one time the vocal cords are opened and closed, the pitch frequency being its inverse, its Fourier transform being the Fourier transform of the signal sequence x (n) when it is x (n)
X(ω)=FFT[x(n)]
Then the sequence
BalanceFor cepstrum, abbreviated cepstrum, here FFT and FFT-1Respectively a fourier transform and an inverse fourier transform,the actual unit of (a) is time s;
speech x (n) is obtained by glottal pulse excitation u (n) filtered by vocal tract response v (n), i.e.
x(n)=u(n)*v(n)
The three quantities have a cepstrum
In the cepstrum, the glottal pulse excitation and the vocal tract response are relatively separated, and thus derived fromThe glottal pulse excitation can be separated and recovered, so that a pitch period is obtained;
step 3.4: formants refer to regions with relatively concentrated energy in the frequency spectrum of sound, and are extracted by LPC method, and one frame signal x (n) of speech signal can be expressed by difference equation
The corresponding vocal tract transfer function is
Taking the power spectrum modulus value, and expressing by P (f)
P(f)=|H(f)|2
Wherein
z-1=e-jωT
The FFT is utilized to obtain the amplitude response of the power spectrum of any frequency, and the information of the formant is found from the amplitude response;
step 3.5: extracting MFCC parameters, preprocessing, converting the original signal x (n) into xi(m), i represents the ith frame, and FFT is performed on the signal
X(i,k)=FFT[xi(m)]
Performing Mel filter bank processing, and calculating spectral line energy for each frame of FFT data
E(i,k)=X(i,k)2
The energy in the Mel filter bank is calculated, i.e. a triangular filter is selected,
logarithm of energy is obtained, and the MFCC parameters are obtained through Discrete Cosine Transform (DCT);
the step 4: completing feature selection based on a decision tree method to form a feature vector, and specifically inputting the following steps: training a data set D, a feature set A and a threshold value e; and (3) outputting: a decision tree T;
step 4.1: if all instances in D belong to the same class CkIf T is a single node tree, C is setkReturning T as the class of the node;
step 4.2: if A is an empty set, T is set as a single node tree, and the class C with the largest number of instances in D is setkReturning T as the class of the node;
step 4.3: otherwise, calculating the information gain ratio of each feature in A to D, and selecting the feature A with the maximum information gain ratiog;
Step 4.4: if A isgIs less than the threshold e, T is setIs a single node tree, and the class C with the largest number of instances in DkReturning T as the class of the node;
step 4.5: otherwise, for AgEach possible value a ofiIn ag=aiDividing D into several non-empty subsets DiD isiThe class with the maximum number of the middle instances is used as a mark, a sub-node is constructed, a tree T is formed by the nodes and the sub-nodes, and the T is returned;
step 4.6: for node i, with DiFor training set, take A- { AgRecursively calling the above steps to obtain a subtree Ti。
2. The lie detection method based on fixed-duration speech emotion recognition sequence analysis according to claim 1, characterized in that: the step 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result at regular time,
step 5.1: completing a lie detection experiment by using an SVM (support vector machine) based on the self-built lie detection corpus to obtain a speech lie detection result;
step 5.2: based on the CASIA standard emotion library, a speech emotion prediction result is output at regular time, the result is a sequence with dimension being related to the duration of speech, a speech emotion recognition result is output every second, and then a 60-dimensional vector is obtained for 60s of speech.
3. The lie detection method based on fixed-duration speech emotion recognition sequence analysis according to claim 1, characterized in that:
the step 6: the lie detection analysis is performed by outputting the speech emotion result at regular time, specifically as follows,
step 6.1: dividing 80% of corpus samples into training samples and 20% of corpus samples into testing samples;
step 6.2: classifying the training samples according to labels, wherein the true speech part is 1, and the lie speech part is-1;
step 6.3: operating the corpus of the training sample according to the steps 1-5 to generate a voice emotion sequence;
step 6.4: describing an HMM model by using five elements, wherein hidden state S is lie condition, observable sequencing column O is speech emotion sequence, hidden state transition probability matrix A, observation state transition matrix B and initial state probability matrix pi;
step 6.5: training an HMM model by using a training sample, namely selecting a plurality of groups of observable sequences O, namely speech emotion sequences from the training sample, and solving parameters in a model lambda (A, B, Π) by using a Baum-Welch method; the specific method comprises the steps of initializing model parameters A, B and pi randomly, using a sample O to calculate and search more appropriate parameters, updating the parameters, and fitting the parameters by using the sample until the parameters are converged;
step 6.6: obtaining an HMM model λ ═ (a, B, Π), and predicting using the test sample itself; and selecting a voice emotion sequence from the voice emotion sequences, and solving the back state sequence by using a Viterbi method to obtain a predicted lie detection result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910659657.1A CN110265063B (en) | 2019-07-22 | 2019-07-22 | Lie detection method based on fixed duration speech emotion recognition sequence analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910659657.1A CN110265063B (en) | 2019-07-22 | 2019-07-22 | Lie detection method based on fixed duration speech emotion recognition sequence analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110265063A CN110265063A (en) | 2019-09-20 |
CN110265063B true CN110265063B (en) | 2021-09-24 |
Family
ID=67927523
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910659657.1A Active CN110265063B (en) | 2019-07-22 | 2019-07-22 | Lie detection method based on fixed duration speech emotion recognition sequence analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110265063B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110969106B (en) * | 2019-11-25 | 2023-04-18 | 东南大学 | Multi-mode lie detection method based on expression, voice and eye movement characteristics |
CN112006697B (en) * | 2020-06-02 | 2022-11-01 | 东南大学 | Voice signal-based gradient lifting decision tree depression degree recognition system |
CN112885370B (en) * | 2021-01-11 | 2024-05-31 | 广州欢城文化传媒有限公司 | Sound card validity detection method and device |
CN113163155B (en) * | 2021-04-30 | 2023-09-05 | 咪咕视讯科技有限公司 | User head portrait generation method and device, electronic equipment and storage medium |
CN115662447B (en) * | 2022-09-22 | 2023-04-07 | 北京邮电大学 | Lie detection analysis method and device based on multi-feature fusion |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1282445A (en) * | 1997-12-16 | 2001-01-31 | 阿维·卡梅尔 | Apparatus and methods for detecting emotions |
CN102890930A (en) * | 2011-07-19 | 2013-01-23 | 上海上大海润信息系统有限公司 | Speech emotion recognizing method based on hidden Markov model (HMM) / self-organizing feature map neural network (SOFMNN) hybrid model |
WO2015168606A1 (en) * | 2014-05-02 | 2015-11-05 | The Regents Of The University Of Michigan | Mood monitoring of bipolar disorder using speech analysis |
CN107705357A (en) * | 2017-09-11 | 2018-02-16 | 广东欧珀移动通信有限公司 | Lie detecting method and device |
CN108175426A (en) * | 2017-12-11 | 2018-06-19 | 东南大学 | A kind of lie detecting method that Boltzmann machine is limited based on depth recursion type condition |
CN109493886A (en) * | 2018-12-13 | 2019-03-19 | 西安电子科技大学 | Speech-emotion recognition method based on feature selecting and optimization |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8972266B2 (en) * | 2002-11-12 | 2015-03-03 | David Bezar | User intent analysis extent of speaker intent analysis system |
-
2019
- 2019-07-22 CN CN201910659657.1A patent/CN110265063B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1282445A (en) * | 1997-12-16 | 2001-01-31 | 阿维·卡梅尔 | Apparatus and methods for detecting emotions |
CN102890930A (en) * | 2011-07-19 | 2013-01-23 | 上海上大海润信息系统有限公司 | Speech emotion recognizing method based on hidden Markov model (HMM) / self-organizing feature map neural network (SOFMNN) hybrid model |
WO2015168606A1 (en) * | 2014-05-02 | 2015-11-05 | The Regents Of The University Of Michigan | Mood monitoring of bipolar disorder using speech analysis |
CN107705357A (en) * | 2017-09-11 | 2018-02-16 | 广东欧珀移动通信有限公司 | Lie detecting method and device |
CN108175426A (en) * | 2017-12-11 | 2018-06-19 | 东南大学 | A kind of lie detecting method that Boltzmann machine is limited based on depth recursion type condition |
CN109493886A (en) * | 2018-12-13 | 2019-03-19 | 西安电子科技大学 | Speech-emotion recognition method based on feature selecting and optimization |
Non-Patent Citations (3)
Title |
---|
《Detecting Deceptive Behavior via Integration of Discriminative Features From Multiple Modalities》;Mohamed Abouelenien et al.;《IEEE Transactions on Information Forensics and Security》;20170531;第12卷;全文 * |
《语音情绪识别的应用和基础研究》;林菡、张侃;《人类工效学》;20090630;第15卷(第2期);第64-66页 * |
《语音测谎技术研究现状与展望》;赵力、梁瑞宁 等;《数据采集与处理》;20170228;第3卷(第2期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110265063A (en) | 2019-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110265063B (en) | Lie detection method based on fixed duration speech emotion recognition sequence analysis | |
CN106228977B (en) | Multi-mode fusion song emotion recognition method based on deep learning | |
CN103928023B (en) | A kind of speech assessment method and system | |
CN104900235B (en) | Method for recognizing sound-groove based on pitch period composite character parameter | |
CN102231278B (en) | Method and system for realizing automatic addition of punctuation marks in speech recognition | |
Sinith et al. | Emotion recognition from audio signals using Support Vector Machine | |
CN108922541B (en) | Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models | |
Kumar et al. | Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm | |
CN110827857B (en) | Speech emotion recognition method based on spectral features and ELM | |
CN101777347B (en) | Model complementary Chinese accent identification method and system | |
CN101226743A (en) | Method for recognizing speaker based on conversion of neutral and affection sound-groove model | |
CN111798874A (en) | Voice emotion recognition method and system | |
Yusnita et al. | Malaysian English accents identification using LPC and formant analysis | |
CN100543840C (en) | Method for distinguishing speek person based on emotion migration rule and voice correction | |
Ramteke et al. | Phoneme boundary detection from speech: A rule based approach | |
Coro et al. | Psycho-acoustics inspired automatic speech recognition | |
CN114842878A (en) | Speech emotion recognition method based on neural network | |
CN109346107B (en) | LSTM-based method for inversely solving pronunciation of independent speaker | |
Rabiee et al. | Persian accents identification using an adaptive neural network | |
CN110838294A (en) | Voice verification method and device, computer equipment and storage medium | |
Rao et al. | Glottal excitation feature based gender identification system using ergodic HMM | |
Lee et al. | Speech emotion recognition using spectral entropy | |
Dharini et al. | CD-HMM Modeling for raga identification | |
Lindgren | Speech recognition using features extracted from phase space reconstructions | |
Lugger et al. | Extracting voice quality contours using discrete hidden Markov models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |