CN110265063B - Lie detection method based on fixed duration speech emotion recognition sequence analysis - Google Patents

Lie detection method based on fixed duration speech emotion recognition sequence analysis Download PDF

Info

Publication number
CN110265063B
CN110265063B CN201910659657.1A CN201910659657A CN110265063B CN 110265063 B CN110265063 B CN 110265063B CN 201910659657 A CN201910659657 A CN 201910659657A CN 110265063 B CN110265063 B CN 110265063B
Authority
CN
China
Prior art keywords
voice
speech
corpus
emotion
lie detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910659657.1A
Other languages
Chinese (zh)
Other versions
CN110265063A (en
Inventor
李玉峰
黄永明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201910659657.1A priority Critical patent/CN110265063B/en
Publication of CN110265063A publication Critical patent/CN110265063A/en
Application granted granted Critical
Publication of CN110265063B publication Critical patent/CN110265063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/16Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
    • A61B5/164Lie detection
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/72Signal processing specially adapted for physiological signals or for diagnostic purposes
    • A61B5/7235Details of waveform analysis
    • A61B5/7246Details of waveform analysis using correlation, e.g. template matching or determination of similarity
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Veterinary Medicine (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Molecular Biology (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Biomedical Technology (AREA)
  • Public Health (AREA)
  • Psychiatry (AREA)
  • Pathology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physiology (AREA)
  • Artificial Intelligence (AREA)
  • Developmental Disabilities (AREA)
  • Educational Technology (AREA)
  • Psychology (AREA)
  • Social Psychology (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a lie detection method based on fixed duration speech emotion recognition sequence analysis, which mainly comprises the following steps: firstly, processing a recorded lie detection corpus to form two types of short-term corpora with equal length, so as to facilitate subsequent experiments; then preprocessing the voice such as pre-emphasis, framing and windowing; extracting voice emotional characteristics including fundamental tone frequency, MFCC, formants, short-time energy, short-time average zero-crossing rate, statistical characteristics of the short-time average zero-crossing rate and the like based on the time-frequency characteristics of voice; selecting features by using a decision tree to finally form a feature vector; training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result in a fixed time; and outputting the voice emotion result at fixed time to perform lie detection analysis. The invention adopts a decision tree method to select the characteristics, thus obtaining higher accuracy; and outputting a result in a vector form, and fully considering the emotion change condition in the lying process.

Description

Lie detection method based on fixed duration speech emotion recognition sequence analysis
Technical Field
The invention belongs to non-contact lie detection, particularly relates to a lie detection method based on fixed duration speech emotion recognition sequence analysis, and belongs to the technical field of lie detection.
Background
The voice is the most direct and convenient communication mode for people, the lie detection based on voice is non-contact, the recording equipment is simple, complex equipment is not needed, the preparation time is short, and a subject does not have great psychological pressure, so that the analysis accuracy is improved. This is of great interest for the studies performed herein. The speech contains much information about the speaker, such as the speaker's identity, gender and age, and even personality. Early studies showed that the speech contained the emotional state of the speaker, implying many reliable speech features in relation to specific emotions. When people are nervous and afraid, the fundamental frequency and the speech rate can rise, and when people are in a panic, the fundamental frequency and the speech rate can fall. Lie is a complex psychophysiological process, and speaking is accompanied by obvious emotion change, so that a great deal of psychological and emotional information can be obtained by using acoustic characteristics (fundamental frequency, voice duration, formant frequency and the like). Research on lie detection technologies with voice features as clues starts relatively late, most people pay attention to the influence of acoustic features on voice lie detection, but so far, no feature can be used for lie detection independently and effectively.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a lie detection method based on fixed-duration speech emotion recognition sequence analysis, which comprises the steps of extracting speech emotion characteristics according to time-frequency characteristics of speech, carrying out characteristic selection based on a decision tree to finally form a 14-dimensional characteristic vector, then using an SVM to finish training and predicting a corpus under a self-built Chinese lie detection corpus, outputting a speech emotion sequence according to speech duration, and using an HMM model to research the relation between the sequence and lie detection.
In order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a lie detection method based on fixed duration speech emotion recognition sequence analysis is characterized by comprising the following steps: the method comprises the following steps:
step 1: establishing a Chinese lie detection corpus and processing the corpus;
step 2: preprocessing the voice;
and step 3: extracting voice emotional characteristics according to the voice time-frequency characteristics;
and 4, step 4: completing feature selection based on a decision tree method to form a feature vector;
and 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result in a fixed time;
step 6: and outputting the voice emotion result at fixed time to perform lie detection analysis.
As a modification of the present invention, the step 1: establishing a Chinese lie detection corpus and processing the corpus; the method comprises the following specific steps:
step 1.1: establishing a Chinese lie detection corpus;
step 1.2: extracting true speech segments in the lie speech corpus;
step 1.3: real words and lie words are divided into equal-duration corpus sections, and labels are pasted, so that subsequent experiments are facilitated.
As a modification of the present invention, the step 2: preprocessing the voice, specifically as follows:
step 2.1: for discretizing the voice signal, pre-emphasis is carried out by using a first-order high-pass filter, wherein the expression of the first-order high-pass filter is as follows:
H(z)=1-αz-1,0.9<α<1.0
step 2.2: framing the signal, wherein the frame length is 30ms, and the frame shift is 10 ms;
step 2.3: selecting a Hamming window function, wherein the calculation formula is as follows:
Figure GDA0003125129050000021
as an improvement of the present invention, step 3: the speech emotion characteristics are extracted, specifically as follows,
step 3.1: extracting short-time energy, wherein the short-time energy refers to the energy of a frame of voice, and setting a voice signal as x (n) and an i frame of voice signal after framing processing of a windowing function omega (n) as yi(n) then yi(n) satisfies:
yi(n)=ω(n)*x((i-1)*inc+n),1≤n≤L,1≤i≤fn
ω (n) is a window function; y isi(n) is a frame number; inc is the frame shift length; fn is the total number of frames after the framing, the short-time energy of the voice signal of the ith frame is
Figure GDA0003125129050000022
Step 3.2: a short-term average zero-crossing rate is extracted, which represents the number of times the waveform of the signal in a frame of speech crosses a zero level. For discrete signals, if adjacent data changes a symbol once and does a zero crossing once, the speech signal is set as x (n), and the i-th frame speech signal after framing is yi(n) a short-time average zero-crossing rate of
Figure GDA0003125129050000023
Step 3.3: extracting the pitch frequency, the pitch period being the duration of one time the vocal cords are opened and closed, the pitch frequency being its inverse, its Fourier transform being the Fourier transform of the signal sequence x (n) when it is x (n)
X(ω)=FFT[x(n)]
Then the sequence
Figure GDA0003125129050000024
Balance
Figure GDA0003125129050000025
For cepstrum, abbreviated cepstrum, here FFT and FFT-1Respectively a fourier transform and an inverse fourier transform,
Figure GDA0003125129050000026
the actual unit of (a) is time s.
Speech x (n) is obtained by glottal pulse excitation u (n) filtered by vocal tract response v (n), i.e.
x(n)=u(n)*v(n)
The three quantities have a cepstrum
Figure GDA0003125129050000027
In the cepstrum, the glottal pulse excitation and the vocal tract response are relatively separated, and thus derived from
Figure GDA0003125129050000035
The glottal pulse excitation can be separated and recovered, so that a pitch period is obtained;
step 3.4: formants refer to regions with relatively concentrated energy in the frequency spectrum of sound, and are extracted by LPC method, and one frame signal x (n) of speech signal can be expressed by difference equation
Figure GDA0003125129050000031
The corresponding vocal tract transfer function is
Figure GDA0003125129050000032
Taking the power spectrum modulus value, and expressing by P (f)
P(f)=|H(f)|2
Wherein
z-1=e-jωT
The FFT is utilized to obtain the amplitude response of the power spectrum of any frequency, and the information of the formant is found from the amplitude response;
step 3.5: extracting MFCC parameters, preprocessing, converting the original signal x (n) into xi(m), i represents the ith frame, and FFT is performed on the signal
X(i,k)=FFT[xi(m)]
Performing Mel filter bank processing, and calculating spectral line energy for each frame of FFT data
E(i,k)=X(i,k)2
Calculating the energy in the Mel Filter Bank (triangular filters)
Figure GDA0003125129050000033
Logarithm of energy is obtained, and the MFCC parameters are obtained through Discrete Cosine Transform (DCT);
Figure GDA0003125129050000034
as a modification of the present invention, the step 4: completing feature selection based on a decision tree method to form a feature vector, and specifically inputting the following steps: training a data set D, a feature set A and a threshold value e; and (3) outputting: a decision tree T;
step 4.1: if all instances in D belong to the same class CkIf T is a single junction tree, and C is setkReturning T as the class of the node;
step 4.2: if A is an empty set, T is set as a single node tree, and the class C with the largest number of instances in D is setkReturning T as the class of the node;
step 4.3: otherwise, calculating the information gain ratio of each feature in A to D, and selecting the feature A with the maximum information gain ratiog
Step 4.4: if A isgIf the information gain ratio of (D) is less than the threshold e, T is set as a single node tree, and the class C with the largest number of instances in D is set askReturning T as the class of the node;
step 4.5: otherwise, for AgEach possible value a ofiIn ag=aiDividing D into several non-empty subsets DiD isiThe class with the maximum number of the middle instances is used as a mark, a sub-node is constructed, a tree T is formed by the nodes and the sub-nodes, and the T is returned;
step 4.6: for node i, with DiFor training set, take A- { AgRecursively calling the above steps to obtain a subtree Ti
As a modification of the present invention, the step 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result at regular time,
step 5.1: completing a lie detection experiment by using an SVM (support vector machine) based on the self-built lie detection corpus to obtain a speech lie detection result;
step 5.2: based on the CASIA standard emotion library, a speech emotion prediction result is output at regular time, the result is a sequence with dimension being related to the duration of speech, a speech emotion recognition result is output every second, and then a 60-dimensional vector is obtained for 60s of speech.
As a modification of the present invention, the step 6: the lie detection analysis is performed by outputting the speech emotion result at regular time, specifically as follows,
step 6.1: dividing 80% of corpus samples into training samples and 20% of corpus samples into testing samples;
step 6.2: classifying the training samples according to labels, wherein the true speech part is 1, and the lie speech part is-1;
step 6.3: operating the corpus of the training sample according to the steps 1-5 to generate a voice emotion sequence;
step 6.4: describing an HMM model by using five elements, namely a hidden state S (lie condition), an observable state O (speech emotion sequence), a hidden state transition probability matrix A, an observed state transition matrix B and an initial state probability matrix pi;
step 6.5: training the HMM model using the training samples. Selecting several groups of observation sequences O (speech emotion sequences) from training samples, and solving parameters in a model lambda (A, B, Π) by using a Baum-Welch method; the specific method comprises the steps of initializing model parameters A, B and pi randomly, using a sample O to calculate and search more appropriate parameters, updating the parameters, and fitting the parameters by using the sample until the parameters are converged;
step 6.6: obtaining an HMM model λ ═ (a, B, Π), and predicting using the test sample itself; and selecting a voice emotion sequence from the voice emotion sequences, and solving the back state sequence by using a Viterbi method to obtain a predicted lie detection result.
Compared with the prior art, the invention has the following beneficial effects:
1. the speech emotion characteristics are various and comprise acoustic, rhythm and spectrum characteristics, but not every one is beneficial to lie detection, invalid characteristics are added, only training time is wasted, and the recognition rate is possibly reduced.
2. And (3) generating a voice emotion sequence related to duration by considering the influence of voice emotion on lie detection, and researching the relation between the voice emotion sequence and lie detection by utilizing an HMM (hidden Markov model).
Drawings
FIG. 1 is a flow chart of a lie detection method based on fixed duration speech emotion recognition sequence analysis according to the present invention;
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1: as shown in fig. 1, the invention provides a lie detection method based on fixed duration speech emotion recognition sequence analysis, which comprises the following detailed steps:
step 1: establishing a Chinese lie detection corpus and processing the corpus;
step 1.1: establishing a Chinese lie detection corpus;
step 1.2: extracting true speech segments in the lie speech corpus;
step 1.3: dividing the real words and the lie words into equal-duration corpus sections, and sticking labels to facilitate subsequent experiments;
step 2: preprocessing the voice;
step 2.1: for discretizing the voice signal, pre-emphasis is carried out by using a first-order high-pass filter, wherein the expression of the first-order high-pass filter is as follows:
H(z)=1-αz-1,0.9<α<1.0;
step 2.2: framing the signal, wherein the frame length is 30ms, and the frame shift is 10 ms;
step 2.3: selecting a Hamming window function, wherein the calculation formula is as follows:
Figure GDA0003125129050000051
and step 3: extracting speech emotion characteristics;
step 3.1: extracting short-time energy, wherein the short-time energy refers to the energy of a frame of voice, and setting a voice signal as x (n) and an i frame of voice signal after framing processing of a windowing function omega (n) as yi(n) then yi(n) satisfies:
yi(n)=ω(n)*x((i-1)*inc+n),1≤n≤L,1≤i≤fn,
ω (n) is a window function; y isi(n) is a frame number; inc is the frame shift length; fn is the total number of frames after the framing, the short-time energy of the voice signal of the ith frame is
Figure GDA0003125129050000061
Step 3.2: a short-term average zero-crossing rate is extracted, which represents the number of times the waveform of the signal in a frame of speech crosses a zero level. For discrete signals, if adjacent data changes a symbol once and does a zero crossing once, the speech signal is set as x (n), and the i-th frame speech signal after framing is yi(n) a short-time average zero-crossing rate of
Figure GDA0003125129050000062
Step 3.3: extracting the pitch frequency, the pitch period being the duration of one time the vocal cords are opened and closed, the pitch frequency being its inverse, its Fourier transform being the Fourier transform of the signal sequence x (n) when it is x (n)
X(ω)=FFT[x(n)];
Then the sequence
Figure GDA0003125129050000063
Balance
Figure GDA0003125129050000064
For cepstrum, abbreviated cepstrum, here FFT and FFT-1Respectively a fourier transform and an inverse fourier transform,
Figure GDA0003125129050000065
the actual unit of (a) is time s.
Speech x (n) is obtained by glottal pulse excitation u (n) filtered by vocal tract response v (n), i.e.
x(n)=u(n)*v(n);
The three quantities have a cepstrum
Figure GDA0003125129050000066
In the cepstrum, the glottal pulse excitation and the vocal tract response are relatively separated, and thus derived from
Figure GDA0003125129050000067
The glottal pulse excitation can be separated and recovered, so that a pitch period is obtained;
step 3.4: formants refer to regions with relatively concentrated energy in the frequency spectrum of sound, and are extracted by LPC method, and one frame signal x (n) of speech signal can be expressed by difference equation
Figure GDA0003125129050000068
The corresponding vocal tract transfer function is
Figure GDA0003125129050000069
Taking the power spectrum modulus value, and expressing by P (f)
P(f)=|H(f)|2
Wherein
z-1=e-jωT
The FFT is utilized to obtain the amplitude response of the power spectrum of any frequency, and the information of the formant is found from the amplitude response;
step 3.5: extracting MFCC parameters, preprocessing, converting the original signal x (n) into xi(m), i represents the ith frame, the opposite channelNumber is FFT
X(i,k)=FFT[xi(m)];
Performing Mel filter bank processing, and calculating spectral line energy for each frame of FFT data
E(i,k)=X(i,k)2
Calculating the energy in the Mel Filter Bank (triangular filters)
Figure GDA0003125129050000071
Logarithm of energy is obtained, and the MFCC parameters are obtained through Discrete Cosine Transform (DCT);
Figure GDA0003125129050000072
and 4, step 4: completing feature selection based on a decision tree method, forming a feature vector, and inputting: training a data set D, a feature set A and a threshold value e; and (3) outputting: a decision tree T;
step 4.1: if all instances in D belong to the same class CkIf T is a single junction tree, and C is setkReturning T as the class of the node;
step 4.2: if A is an empty set, T is set as a single node tree, and the class C with the largest number of instances in D is setkReturning T as the class of the node;
step 4.3: otherwise, calculating the information gain ratio of each feature in A to D, and selecting the feature A with the maximum information gain ratiog
Step 4.4: if A isgIf the information gain ratio of (D) is less than the threshold e, T is set as a single node tree, and the class C with the largest number of instances in D is set askReturning T as the class of the node;
step 4.5: otherwise, for AgEach possible value a ofiIn ag=aiDividing D into several non-empty subsets DiD isiThe class with the maximum number of the middle instances is used as a mark, a sub-node is constructed, a tree T is formed by the nodes and the sub-nodes, and the T is returned;
step 4.6: for node i, with DiFor training set, take A- { AgRecursively calling the above steps to obtain a subtree Ti
And 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result in a fixed time;
step 5.1: completing a lie detection experiment by using an SVM (support vector machine) based on the self-built lie detection corpus to obtain a speech lie detection result;
step 5.2: based on a CASIA standard emotion library, a speech emotion prediction result is output at regular time, the result is a sequence with dimension being related to speech duration, a speech emotion recognition result is output every second, and then a 60-dimensional vector is obtained for a 60s speech;
step 6: carrying out lie detection analysis by using the timed output voice emotion result;
step 6.1: dividing 80% of corpus samples into training samples and 20% of corpus samples into testing samples;
step 6.2: classifying the training samples according to labels, wherein the true speech part is 1, and the lie speech part is-1;
step 6.3: operating the corpus of the training sample according to the steps 1-5 to generate a voice emotion sequence;
step 6.4: describing an HMM model by using five elements, namely a hidden state S (lie condition), an observable state O (speech emotion sequence), a hidden state transition probability matrix A, an observed state transition matrix B and an initial state probability matrix pi;
step 6.5: training the HMM model using the training samples. Selecting several groups of observation sequences O (speech emotion sequences) from training samples, and solving parameters in a model lambda (A, B, Π) by using a Baum-Welch method; the specific method comprises the steps of initializing model parameters A, B and pi randomly, using a sample O to calculate and search more appropriate parameters, updating the parameters, and fitting the parameters by using the sample until the parameters are converged;
step 6.6: obtaining an HMM model λ ═ (a, B, Π), and predicting using the test sample itself; and selecting a voice emotion sequence from the voice emotion sequences, and solving the back state sequence by using a Viterbi method to obtain a predicted lie detection result.
It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the above-mentioned technical solutions belong to the scope of the present invention.

Claims (3)

1. A lie detection method based on fixed duration speech emotion recognition sequence analysis is characterized by comprising the following steps: the method comprises the following steps:
step 1: establishing a Chinese lie detection corpus and processing the corpus;
step 2: preprocessing the voice;
and step 3: extracting voice emotional characteristics according to the voice time-frequency characteristics;
and 4, step 4: completing feature selection based on a decision tree method to form a feature vector;
and 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result in a fixed time;
step 6: carrying out lie detection analysis by using the timed output voice emotion result;
the step 1: establishing a Chinese lie detection corpus and processing the corpus; the method comprises the following specific steps:
step 1.1: establishing a Chinese lie detection corpus;
step 1.2: extracting true speech segments in the lie speech corpus;
step 1.3: dividing the real words and the lie words into equal-duration corpus sections, and sticking labels to facilitate subsequent experiments;
the step 2: preprocessing the voice, specifically as follows:
step 2.1: for discretizing the voice signal, pre-emphasis is carried out by using a first-order high-pass filter, wherein the expression of the first-order high-pass filter is as follows:
H(z)=1-αz-1,0.9<α<1.0
step 2.2: framing the signal, wherein the frame length is 30ms, and the frame shift is 10 ms;
step 2.3: selecting a Hamming window function, wherein the calculation formula is as follows:
Figure FDA0003201381700000011
and step 3: the speech emotion characteristics are extracted, specifically as follows,
step 3.1: extracting short-time energy, wherein the short-time energy refers to the energy of a frame of voice, and setting a voice signal as x (n) and an i frame of voice signal after framing processing of a windowing function omega (n) as yi(n) then yi(n) satisfies:
yi(n) ═ ω (n) × (i-1) × ((i-1) × inc + n), 1 ≦ n ≦ L, 1 ≦ i ≦ fn ω (n) for the window function; y isi(n) is a frame number; inc is the frame shift length; fn is the total number of frames after the framing, the short-time energy of the voice signal of the ith frame is
Figure FDA0003201381700000021
Step 3.2: extracting a short-time average zero crossing rate which represents the number of times that the waveform of a signal in a frame of speech passes through a zero level; for discrete signals, if adjacent data changes a symbol once and does a zero crossing once, the speech signal is set as x (n), and the i-th frame speech signal after framing is yi(n) a short-time average zero-crossing rate of
Figure FDA0003201381700000022
Step 3.3: extracting the pitch frequency, the pitch period being the duration of one time the vocal cords are opened and closed, the pitch frequency being its inverse, its Fourier transform being the Fourier transform of the signal sequence x (n) when it is x (n)
X(ω)=FFT[x(n)]
Then the sequence
Figure FDA0003201381700000023
Balance
Figure FDA0003201381700000024
For cepstrum, abbreviated cepstrum, here FFT and FFT-1Respectively a fourier transform and an inverse fourier transform,
Figure FDA0003201381700000025
the actual unit of (a) is time s;
speech x (n) is obtained by glottal pulse excitation u (n) filtered by vocal tract response v (n), i.e.
x(n)=u(n)*v(n)
The three quantities have a cepstrum
Figure FDA0003201381700000026
In the cepstrum, the glottal pulse excitation and the vocal tract response are relatively separated, and thus derived from
Figure FDA0003201381700000027
The glottal pulse excitation can be separated and recovered, so that a pitch period is obtained;
step 3.4: formants refer to regions with relatively concentrated energy in the frequency spectrum of sound, and are extracted by LPC method, and one frame signal x (n) of speech signal can be expressed by difference equation
Figure FDA0003201381700000028
The corresponding vocal tract transfer function is
Figure FDA0003201381700000029
Taking the power spectrum modulus value, and expressing by P (f)
P(f)=|H(f)|2
Wherein
z-1=e-jωT
The FFT is utilized to obtain the amplitude response of the power spectrum of any frequency, and the information of the formant is found from the amplitude response;
step 3.5: extracting MFCC parameters, preprocessing, converting the original signal x (n) into xi(m), i represents the ith frame, and FFT is performed on the signal
X(i,k)=FFT[xi(m)]
Performing Mel filter bank processing, and calculating spectral line energy for each frame of FFT data
E(i,k)=X(i,k)2
The energy in the Mel filter bank is calculated, i.e. a triangular filter is selected,
Figure FDA0003201381700000031
logarithm of energy is obtained, and the MFCC parameters are obtained through Discrete Cosine Transform (DCT);
Figure FDA0003201381700000032
the step 4: completing feature selection based on a decision tree method to form a feature vector, and specifically inputting the following steps: training a data set D, a feature set A and a threshold value e; and (3) outputting: a decision tree T;
step 4.1: if all instances in D belong to the same class CkIf T is a single node tree, C is setkReturning T as the class of the node;
step 4.2: if A is an empty set, T is set as a single node tree, and the class C with the largest number of instances in D is setkReturning T as the class of the node;
step 4.3: otherwise, calculating the information gain ratio of each feature in A to D, and selecting the feature A with the maximum information gain ratiog
Step 4.4: if A isgIs less than the threshold e, T is setIs a single node tree, and the class C with the largest number of instances in DkReturning T as the class of the node;
step 4.5: otherwise, for AgEach possible value a ofiIn ag=aiDividing D into several non-empty subsets DiD isiThe class with the maximum number of the middle instances is used as a mark, a sub-node is constructed, a tree T is formed by the nodes and the sub-nodes, and the T is returned;
step 4.6: for node i, with DiFor training set, take A- { AgRecursively calling the above steps to obtain a subtree Ti
2. The lie detection method based on fixed-duration speech emotion recognition sequence analysis according to claim 1, characterized in that: the step 5: training a corpus by using an SVM, predicting the tested voice and outputting a voice emotion result at regular time,
step 5.1: completing a lie detection experiment by using an SVM (support vector machine) based on the self-built lie detection corpus to obtain a speech lie detection result;
step 5.2: based on the CASIA standard emotion library, a speech emotion prediction result is output at regular time, the result is a sequence with dimension being related to the duration of speech, a speech emotion recognition result is output every second, and then a 60-dimensional vector is obtained for 60s of speech.
3. The lie detection method based on fixed-duration speech emotion recognition sequence analysis according to claim 1, characterized in that:
the step 6: the lie detection analysis is performed by outputting the speech emotion result at regular time, specifically as follows,
step 6.1: dividing 80% of corpus samples into training samples and 20% of corpus samples into testing samples;
step 6.2: classifying the training samples according to labels, wherein the true speech part is 1, and the lie speech part is-1;
step 6.3: operating the corpus of the training sample according to the steps 1-5 to generate a voice emotion sequence;
step 6.4: describing an HMM model by using five elements, wherein hidden state S is lie condition, observable sequencing column O is speech emotion sequence, hidden state transition probability matrix A, observation state transition matrix B and initial state probability matrix pi;
step 6.5: training an HMM model by using a training sample, namely selecting a plurality of groups of observable sequences O, namely speech emotion sequences from the training sample, and solving parameters in a model lambda (A, B, Π) by using a Baum-Welch method; the specific method comprises the steps of initializing model parameters A, B and pi randomly, using a sample O to calculate and search more appropriate parameters, updating the parameters, and fitting the parameters by using the sample until the parameters are converged;
step 6.6: obtaining an HMM model λ ═ (a, B, Π), and predicting using the test sample itself; and selecting a voice emotion sequence from the voice emotion sequences, and solving the back state sequence by using a Viterbi method to obtain a predicted lie detection result.
CN201910659657.1A 2019-07-22 2019-07-22 Lie detection method based on fixed duration speech emotion recognition sequence analysis Active CN110265063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910659657.1A CN110265063B (en) 2019-07-22 2019-07-22 Lie detection method based on fixed duration speech emotion recognition sequence analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910659657.1A CN110265063B (en) 2019-07-22 2019-07-22 Lie detection method based on fixed duration speech emotion recognition sequence analysis

Publications (2)

Publication Number Publication Date
CN110265063A CN110265063A (en) 2019-09-20
CN110265063B true CN110265063B (en) 2021-09-24

Family

ID=67927523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910659657.1A Active CN110265063B (en) 2019-07-22 2019-07-22 Lie detection method based on fixed duration speech emotion recognition sequence analysis

Country Status (1)

Country Link
CN (1) CN110265063B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110969106B (en) * 2019-11-25 2023-04-18 东南大学 Multi-mode lie detection method based on expression, voice and eye movement characteristics
CN112006697B (en) * 2020-06-02 2022-11-01 东南大学 Voice signal-based gradient lifting decision tree depression degree recognition system
CN112885370B (en) * 2021-01-11 2024-05-31 广州欢城文化传媒有限公司 Sound card validity detection method and device
CN113163155B (en) * 2021-04-30 2023-09-05 咪咕视讯科技有限公司 User head portrait generation method and device, electronic equipment and storage medium
CN115662447B (en) * 2022-09-22 2023-04-07 北京邮电大学 Lie detection analysis method and device based on multi-feature fusion

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1282445A (en) * 1997-12-16 2001-01-31 阿维·卡梅尔 Apparatus and methods for detecting emotions
CN102890930A (en) * 2011-07-19 2013-01-23 上海上大海润信息系统有限公司 Speech emotion recognizing method based on hidden Markov model (HMM) / self-organizing feature map neural network (SOFMNN) hybrid model
WO2015168606A1 (en) * 2014-05-02 2015-11-05 The Regents Of The University Of Michigan Mood monitoring of bipolar disorder using speech analysis
CN107705357A (en) * 2017-09-11 2018-02-16 广东欧珀移动通信有限公司 Lie detecting method and device
CN108175426A (en) * 2017-12-11 2018-06-19 东南大学 A kind of lie detecting method that Boltzmann machine is limited based on depth recursion type condition
CN109493886A (en) * 2018-12-13 2019-03-19 西安电子科技大学 Speech-emotion recognition method based on feature selecting and optimization

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8972266B2 (en) * 2002-11-12 2015-03-03 David Bezar User intent analysis extent of speaker intent analysis system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1282445A (en) * 1997-12-16 2001-01-31 阿维·卡梅尔 Apparatus and methods for detecting emotions
CN102890930A (en) * 2011-07-19 2013-01-23 上海上大海润信息系统有限公司 Speech emotion recognizing method based on hidden Markov model (HMM) / self-organizing feature map neural network (SOFMNN) hybrid model
WO2015168606A1 (en) * 2014-05-02 2015-11-05 The Regents Of The University Of Michigan Mood monitoring of bipolar disorder using speech analysis
CN107705357A (en) * 2017-09-11 2018-02-16 广东欧珀移动通信有限公司 Lie detecting method and device
CN108175426A (en) * 2017-12-11 2018-06-19 东南大学 A kind of lie detecting method that Boltzmann machine is limited based on depth recursion type condition
CN109493886A (en) * 2018-12-13 2019-03-19 西安电子科技大学 Speech-emotion recognition method based on feature selecting and optimization

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Detecting Deceptive Behavior via Integration of Discriminative Features From Multiple Modalities》;Mohamed Abouelenien et al.;《IEEE Transactions on Information Forensics and Security》;20170531;第12卷;全文 *
《语音情绪识别的应用和基础研究》;林菡、张侃;《人类工效学》;20090630;第15卷(第2期);第64-66页 *
《语音测谎技术研究现状与展望》;赵力、梁瑞宁 等;《数据采集与处理》;20170228;第3卷(第2期);全文 *

Also Published As

Publication number Publication date
CN110265063A (en) 2019-09-20

Similar Documents

Publication Publication Date Title
CN110265063B (en) Lie detection method based on fixed duration speech emotion recognition sequence analysis
CN106228977B (en) Multi-mode fusion song emotion recognition method based on deep learning
CN103928023B (en) A kind of speech assessment method and system
CN104900235B (en) Method for recognizing sound-groove based on pitch period composite character parameter
CN102231278B (en) Method and system for realizing automatic addition of punctuation marks in speech recognition
Sinith et al. Emotion recognition from audio signals using Support Vector Machine
CN108922541B (en) Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models
Kumar et al. Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm
CN110827857B (en) Speech emotion recognition method based on spectral features and ELM
CN101777347B (en) Model complementary Chinese accent identification method and system
CN101226743A (en) Method for recognizing speaker based on conversion of neutral and affection sound-groove model
CN111798874A (en) Voice emotion recognition method and system
Yusnita et al. Malaysian English accents identification using LPC and formant analysis
CN100543840C (en) Method for distinguishing speek person based on emotion migration rule and voice correction
Ramteke et al. Phoneme boundary detection from speech: A rule based approach
Coro et al. Psycho-acoustics inspired automatic speech recognition
CN114842878A (en) Speech emotion recognition method based on neural network
CN109346107B (en) LSTM-based method for inversely solving pronunciation of independent speaker
Rabiee et al. Persian accents identification using an adaptive neural network
CN110838294A (en) Voice verification method and device, computer equipment and storage medium
Rao et al. Glottal excitation feature based gender identification system using ergodic HMM
Lee et al. Speech emotion recognition using spectral entropy
Dharini et al. CD-HMM Modeling for raga identification
Lindgren Speech recognition using features extracted from phase space reconstructions
Lugger et al. Extracting voice quality contours using discrete hidden Markov models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant