CN112259126B - Robot and method for assisting in identifying autism voice features - Google Patents

Robot and method for assisting in identifying autism voice features Download PDF

Info

Publication number
CN112259126B
CN112259126B CN202011016520.3A CN202011016520A CN112259126B CN 112259126 B CN112259126 B CN 112259126B CN 202011016520 A CN202011016520 A CN 202011016520A CN 112259126 B CN112259126 B CN 112259126B
Authority
CN
China
Prior art keywords
voice
layer
autism
recognition model
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011016520.3A
Other languages
Chinese (zh)
Other versions
CN112259126A (en
Inventor
陈首彦
张铭焰
杨晓芬
赵志甲
朱大昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202011016520.3A priority Critical patent/CN112259126B/en
Publication of CN112259126A publication Critical patent/CN112259126A/en
Application granted granted Critical
Publication of CN112259126B publication Critical patent/CN112259126B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/48Other medical applications
    • A61B5/4803Speech analysis specially adapted for diagnostic purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Public Health (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Hospice & Palliative Care (AREA)
  • Epidemiology (AREA)
  • Psychiatry (AREA)
  • Pathology (AREA)
  • Child & Adolescent Psychology (AREA)
  • Medical Informatics (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Veterinary Medicine (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses a robot and a method for assisting in identifying autism voice characteristics, wherein the robot comprises the following components: the self-closing symptom voice feature recognition model construction and training unit constructs a self-closing symptom voice feature recognition model by using a long and short memory neural network and a convolution neural network, takes quantized voice features as a sensing signal input model, learns the performance features of the voice features in a sensing signal, trains the model by using a back propagation method, realizes the optimization of a classifier in the network weight, and finally obtains the self-closing symptom voice feature recognition model which can be used for voice signal recognition; the voice acquisition unit acquires voice information of the tested person in the interaction process of the robot and the tested person; the voice information preprocessing unit is used for preprocessing the collected voice information and quantizing voice characteristics into M-dimensional voice characteristic vectors; and the voice characteristic recognition unit is used for carrying out voice characteristic recognition on the preprocessed voice signal by utilizing the trained model.

Description

Robot and method for assisting in identifying autism voice features
Technical Field
The invention relates to the technical field of voice emotion recognition, in particular to an autism voice feature auxiliary recognition robot and method based on an LSTM (Long Short-Term Memory) and a CNN (Convolutional Neural Networks, convolutional neural network).
Background
Autism Spectrum Disorder (ASD), also known as autism, has become of increasing social interest. In china, the number of autistic pediatric patients between 0 and 14 years of age is between 300 and 500 tens of thousands. The existing evaluation method for autism is mainly focused on three aspects of language communication disorder, social interaction disorder and repeated plate carving behavior. Effective and accurate assessment of ASD requires observation of children by a clinically experienced professional medical professional and testing together. The method requires a large amount of manpower to sort the data, has low efficiency and a certain man-made subjectivity, and can have larger error of the evaluation result.
On the other hand, the existing speech emotion recognition methods mainly include a deep belief network-based speech emotion recognition method, a long short term memory network (LSTM) -based speech emotion recognition method, and a Convolutional Neural Network (CNN) -based speech emotion recognition method. The three methods have the main disadvantage that the advantages of each network model cannot be considered. For example, a deep belief network may use a one-dimensional sequence as input, but cannot take advantage of correlations between the sequence front and back; the long-term and short-term memory network can utilize the correlation between the front and the back of the sequence, but the extracted feature dimension is higher; the convolutional neural network cannot directly process the voice sequence, and fourier transform is required to be performed on the voice signal, and the voice signal is converted into a frequency spectrum and then is used as input. The traditional voice emotion recognition method has small prospect in feature extraction and classification development, and the network structure of the traditional voice emotion recognition method based on deep learning is single.
In summary, in the existing autism screening technology, manual screening is still dominant, but the manual screening needs to spend a large amount of manpower to sort data, and the manual screening has a certain subjectivity, so that the screening result has a certain error, but in the existing autism voice feature recognition technology, only the content in voice is simply converted into text content, and the method is only suitable for low-function autism objects, but not high-function autism objects; on the other hand, in the conventional speech emotion recognition technology, most people use a Support Vector Machine (SVM) and a Hidden Markov Model (HMM) for speech recognition, but the model accuracy is not high and is easily affected by noise.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide an autism voice feature auxiliary recognition robot and an autism voice feature auxiliary recognition method, so as to solve the problems of large manual screening error and low efficiency in the existing autism screening in an auxiliary manner, and improve the robustness and accuracy of voice feature recognition.
To achieve the above and other objects, the present invention provides an autism voice feature assisted recognition robot, comprising:
the self-closing symptom voice feature recognition model construction and training unit constructs a self-closing symptom voice feature recognition model by using a long and short memory neural network and a convolution neural network, takes quantized voice features as sensing signals to be input into the self-closing symptom voice feature recognition model, learns the performance features of the voice features in the sensing signals, trains the self-closing symptom voice feature recognition model by using a back propagation method, realizes optimization of a classifier in network weight, and finally obtains the self-closing symptom voice feature recognition model capable of being used for voice signal recognition;
the voice acquisition unit is used for acquiring voice information of the tested person in the interaction process of the robot and the tested person;
the voice information preprocessing unit is used for preprocessing the collected voice information and quantizing voice characteristics into M-dimensional voice characteristic vectors;
and the voice characteristic recognition unit is used for carrying out voice characteristic recognition on the voice signals acquired by the voice acquisition unit and processed by the voice information preprocessing unit by utilizing the trained autism voice characteristic recognition model.
Preferably, the autism voice feature recognition model is sequentially connected by an input layer, an LSTM network layer, a BN1 layer, a CNN network layer, a pooling layer, a BN2 layer, a flame layer, a dropout layer, a full connection layer and an output layer.
Preferably, the LSTM network is configured to process long-sequence voice, where the LSTM1 layer and the LSTM2 layer are sequentially connected, the LSTM1 layer and the LSTM2 layer activation function are both Tanh, and the LSTM network output is a voice feature sequence.
Preferably, the LSTM1 layer and the LSTM2 layer of the LSTM network respectively include an output gate, an input gate, and a forget gate, and the output information is controlled by parameters of the gates
Figure SMS_1
Input door->
Figure SMS_2
From the current input data->
Figure SMS_3
And the cell output of the previous time>
Figure SMS_4
Decision, forget door->
Figure SMS_5
Control the transmission of history information, output gate->
Figure SMS_6
Preferably, the CNN network is a convolution layer, the convolution operation is performed on the feature vector processed by the upper layer and the convolution kernel of the current layer, the features of the original signal are enhanced, the noise is reduced, and finally, the convolution calculation result is given by the activation function.
Preferably, the CNN network is sequentially connected by a conv1D1 layer, a pooling layer, and a conv1D2 layer.
Preferably, the voice information preprocessing unit further includes:
the pre-emphasis processing module is used for pre-emphasizing an input voice signal;
the framing windowing module is used for segmenting the voice signal to analyze the characteristic parameters thereof and analyzing a characteristic parameter time sequence consisting of the characteristic parameters of each frame;
the fast Fourier transform module is used for obtaining a corresponding frequency spectrum for each frame of signal through fast Fourier transform;
the triangular band-pass filtering module is used for enabling the frequency spectrum obtained through fast Fourier to pass through a triangular filter bank with a set of Mel scales to obtain Mel frequency spectrum;
the logarithmic energy calculating module is used for calculating the logarithmic energy of each frame signal so as to distinguish unvoiced sound and voiced sound and judge a silent section and a voiced section in each frame;
the discrete cosine transform module is used for substituting the calculated logarithmic energy into a discrete cosine transform formula to calculate an L-order MEL inverse general parameter C (n).
Preferably, the digital filter for voice passing in the pre-emphasis is:
Figure SMS_7
wherein μ is a pre-emphasis coefficient, z is a complex number, referring to the frequency of the speech signal;
the relation between the output and input speech signals S (n) of the pre-emphasis network is:
Figure SMS_8
where a is also the pre-emphasis coefficient.
Preferably, the framing and windowing module is implemented in a manner of weighting a movable limited-length window, and the windowing signal is:
S_W(n) = S(n)*w(n)
the window function is:
Figure SMS_9
in order to achieve the above purpose, the present invention also provides an autism voice feature auxiliary recognition method, comprising the following steps:
step S1, constructing an autism voice feature recognition model based on a long and short memory neural network and a convolutional neural network, inputting quantized voice features serving as sensing signals into the autism voice feature recognition model, learning the performance features of the voice features in the sensing signals, training the autism voice feature recognition model by using a back propagation method, optimizing the network weight of a classifier, and finally obtaining the autism voice feature recognition model capable of being used for voice signal recognition;
s2, collecting voice information of a tested person in the interaction process of the robot and the tested person;
step S3, preprocessing the collected voice information, and quantizing voice features into M-dimensional voice feature vectors;
and S4, performing voice feature recognition on the voice signals acquired in the step S2 and processed in the step S3 by using a trained autism voice feature recognition model.
Compared with the prior art, the invention provides the robot and the method for assisting in identifying the autism voice features, and the autism voice feature identification model is designed by utilizing a long-short-term memory neural network (LSTM) and a Convolutional Neural Network (CNN). The method comprises the steps of collecting autism voice characteristics, inputting voice sensing signals as sensing signals, utilizing the autism recognition model to learn the performance characteristics of the autism voice characteristics in the sensing signals, utilizing a back propagation method to train the autism voice characteristic recognition model, realizing optimization of a classifier in network weights, finally obtaining the autism voice characteristic recognition model, and then utilizing the obtained autism voice characteristic recognition model to recognize the voice signals of autism patients, so that the problems of large manual screening errors and low efficiency in the existing autism screening technology can be assisted to be solved.
Drawings
FIG. 1 is a system architecture diagram of an autism speech feature assisted recognition robot of the present invention;
FIG. 2 is a schematic diagram of the structure of an autism speech feature recognition model constructed in an embodiment of the present invention;
FIG. 3 is a flowchart illustrating steps of an autism speech feature assisted recognition method according to the present invention;
FIG. 4 is a schematic layout of an experimental site according to an embodiment of the present invention;
fig. 5 is a flow chart of an embodiment of the present invention.
Detailed Description
Other advantages and effects of the present invention will become readily apparent to those skilled in the art from the following disclosure, when considered in light of the accompanying drawings, by describing embodiments of the present invention with specific embodiments thereof. The invention may be practiced or carried out in other embodiments and details within the scope and range of equivalents of the various features and advantages of the invention.
FIG. 1 is a diagram of a system architecture for an autism speech feature assisted recognition robot according to the present invention. As shown in fig. 1, the robot for assisting in recognition of autism voice features according to the present invention comprises:
the autism speech feature recognition model construction and training unit 101 constructs an autism speech feature recognition model by using a long short memory neural network (LSTM) and a Convolutional Neural Network (CNN), inputs quantized speech features as sensing signals into the autism speech feature recognition model, learns the performance features of the speech features in the sensing signals (such as the volume, the tone, the pause time of the speech during the description of the speech features, and the like, which can be expressed as reaching a certain value or a certain class of specific sequence matrix in the sensing signals), trains the autism speech feature recognition model by using a back propagation method, realizes the optimization of a classifier in the network weight, and finally obtains the autism speech feature recognition model for speech signal recognition.
In the invention, the autism voice feature recognition model is formed by sequentially connecting an input layer, an LSTM network layer, a BN1 layer, a CNN network layer, a pooling layer, a BN2 layer, a flame layer, a dropout layer, a full connection layer and an output layer, as shown in figure 2.
The input layer is used for obtaining quantized M-dimensional speech feature vectors, and in the present invention, the input layer is an NxN feature matrix, that is, an M-dimensional feature vector obtained by quantization processing performed by the speech information preprocessing unit 103 and then converted into an NxN feature matrix.
The LSTM network is used as an improved network of the traditional recurrent neural network, stores voice information for a long time, is a neural network with a memory function, and can model time sequence data. In the invention, the LSTM network is formed by sequentially connecting an LSTM1 layer and an LSTM2 layer, wherein the LSTM1 layer has an output dimension of 50, the LSTM2 layer has an output dimension of 30, and the activation functions are all Tanh; the LSTM network is used for processing long-sequence voice, and the LSTM network model outputs a voice characteristic sequence with 30 dimensions.
Specifically, the LSTM1 layer and the LSTM2 layer of the LSTM network respectively include: output gate, input gate, forget gate, and output information is controlled by parameters of each gate
Figure SMS_10
Use->
Figure SMS_11
And->
Figure SMS_12
Respectively representing the input value and the output value of the LSTM network, t time candidate memory cell information +.>
Figure SMS_13
The calculation is as follows:
Figure SMS_14
input door
Figure SMS_15
From the current input data->
Figure SMS_16
And the cell output of the previous time>
Figure SMS_17
Determining, the calculation formula is:
Figure SMS_18
forgetting door
Figure SMS_19
And controlling the transfer of history information, wherein the calculation formula is as follows:
Figure SMS_20
output door
Figure SMS_21
The calculation formula is as follows:
Figure SMS_22
the CNN network, i.e., the convolution layer, may be regarded as a fuzzy filter, which performs a convolution operation on the feature vector processed by the upper layer and the convolution kernel of the current layer, enhancing the features of the original signal and reducing noise. And finally, giving a convolution calculation result by the activation function. The convolution layer may be described as:
Z(n) = x(n) * w(n)
Figure SMS_23
the signal x (n) is a voice characteristic sequence with the dimension of 30, which is output after a voice signal passes through two LSTM layers and one BN layer, w (n) is a convolution kernel, and the output result z (n) of the convolution layer is obtained by convolving the signal x (n) with the convolution kernel w (n) with the size of L.
In a specific embodiment of the present invention, the CNN network is sequentially connected by a conv1D1 layer, a pooling layer, and a conv1D2 layer, where the number of conv1D1 layer filters is 512, the convolution kernel size is 3, the number of conv1D2 layer filters is 256, the convolution kernel size is 3, and the activation functions are all Relu; the pool size of the maximum pooling layer is 2; the CNN network output layer is the voice characteristic sequence after filtering.
Of course, the autism speech feature recognition model further includes a pooling layer, a BN2 layer, a flat layer, a dropout layer, a full connection layer, and an output layer, where the pooling layer is mainly used to remove redundant information, compress features, and simplify complexity of a neural network, the BN2 layer is mainly used to accelerate training and convergence speed of the network and prevent overfitting, the flat layer is mainly used to unidimensionally input, the full connection layer is mainly used to classify information, and the output layer is mainly used to output sequences from the full connection layer, and since the pooling layer, the BN2 layer, the flat layer, the dropout layer, the full connection layer, and the output layer are not important points of the present invention, implementation of the pooling layer, the BN2 layer, the dropout layer, the full connection layer, and the output layer are the same as in the prior art, so that the key point of the present invention is not described herein.
After the autism speech feature recognition model is established, quantized speech features obtained by the speech acquisition unit 102 and the speech information preprocessing unit 103 are used as sensing signals to be input into the autism speech feature recognition model, the performance features of the speech features on the sensing signals are learned, the autism speech feature recognition model is trained by using a back propagation method, optimization of the classifier on the network weight is achieved, and finally the autism speech feature recognition model capable of being used for speech signal recognition is obtained.
The voice acquisition unit 102 is used for acquiring voice information of the tested person in the interaction process of the robot and the tested person.
In a specific embodiment of the present invention, the voice acquisition unit 102 may acquire voice information during the interactive screening process between the robot and the tested person through a microphone built in the robot or a wearable microphone on the evaluator or the tested person. In the invention, the robot is used as a main body in the screening process and has the characteristic of a human-like body, and the interest of the testee is attracted by displaying songs and dances to the testee, and the testee is guided to send out more voice information as much as possible.
The voice information preprocessing unit 103 is configured to preprocess the collected voice information, and quantize voice features into M-dimensional voice feature vectors.
Specifically, the voice information preprocessing unit 103 further includes:
and the pre-emphasis processing module is used for pre-emphasizing the input voice signal.
In the specific embodiment of the invention, the pre-emphasis is realized by adopting a digital filter, and the digital filter for voice passing in the pre-emphasis is as follows:
Figure SMS_24
where μ is the pre-emphasis coefficient and z is a complex number, referring to the frequency of the speech signal.
The relation between the output and input speech signals S (n) of the pre-emphasis network is:
Figure SMS_25
where a is also the pre-emphasis coefficient.
And the framing and windowing module is used for segmenting the voice signal to analyze the characteristic parameters thereof and analyzing a characteristic parameter time sequence consisting of the characteristic parameters of each frame.
In the specific embodiment of the invention, the framing and windowing is implemented by weighting a movable limited-length window, that is, a certain window function w (n) is used to multiply s (n), and the windowing signal is:
S_W(n) = S(n)*w(n)
the invention uses Hamming window with the following window functions:
Figure SMS_26
and the fast Fourier transform module is used for obtaining a corresponding frequency spectrum by Fast Fourier Transform (FFT) for each frame of signal. Specifically, after the voice signal s (n) is multiplied by the hamming window in the framing and windowing module, each frame must be subjected to a fast fourier transform to obtain an energy distribution over the frequency spectrum, that is, to obtain a corresponding frequency spectrum.
The triangular band-pass filtering module is used for enabling the spectrum obtained through fast Fourier to pass through a triangular filter bank with a set of Mel scales to obtain Mel spectrum, and in the method, a filter bank with M filters is defined, the adopted filters are triangular filters, and M is usually 22-26. The triangular band-pass filtering module aims to smooth the frequency spectrum and eliminate the function of harmonic waves, and highlights formants of the original voice.
The logarithmic energy calculating module is used for calculating the logarithmic energy of each frame signal to distinguish unvoiced sound and voiced sound and judge the silent section and the voiced section in each frame, wherein the logarithmic energy refers to the volume, the calculating method is that the square sum of signals in one frame is calculated, the logarithmic value based on 10 is taken, and the logarithmic value is multiplied by 10, so that the basic voice characteristics of each frame are multiple in one dimension.
Logarithm of energy S (m)
Figure SMS_27
And the Discrete Cosine Transform (DCT) module is used for substituting the logarithmic energy into a discrete cosine transform formula to obtain an L-order MEL inverse general parameter C (n). L is the order of the speech feature, typically taking 12-16, and M is the number of triangular filters. The following is the formula for discrete cosine transform:
Figure SMS_28
wherein, C (n) is the final required voice feature, i.e. the feature matrix of NxN converted by the M-dimensional feature vector.
The voice feature recognition unit 103 is configured to perform voice feature recognition on the voice signal acquired by the voice acquisition unit 102 and processed by the voice information preprocessing unit 103 by using the trained autism voice feature recognition model. In a specific embodiment of the present invention, the emotion classification result output by the autism speech feature recognition model includes, but is not limited to: happiness, qi, fear, sadness, surprise and neutrality.
FIG. 3 is a flowchart illustrating steps of an autism speech feature assisted recognition method according to the present invention. As shown in fig. 3, the method for assisting in identifying the autism voice features comprises the following steps:
step S1, constructing an autism voice feature recognition model based on a long short memory neural network (LSTM) and a Convolutional Neural Network (CNN), inputting quantized voice features serving as sensing signals into the autism voice feature recognition model, learning the performance features of the voice features in the sensing signals (such as the volume, the tone, the pause time of voice during the period and the like of the quantity describing the voice features, which can be expressed as a certain value or a certain specific sequence matrix in the sensing signals), training the autism voice feature recognition model by using a back propagation method, optimizing the network weights of a classifier, and finally obtaining the autism voice feature recognition model capable of being used for voice signal recognition.
In the invention, the autism voice feature recognition model is sequentially connected by an input layer, an LSTM network layer, a BN1 layer, a CNN network layer, a pooling layer, a BN2 layer, a flame layer, a dropout layer, a full connection layer and an output layer.
The input layer is used for obtaining quantized M-dimensional speech feature vectors, and in the present invention, the input layer is an NxN feature matrix, that is, an M-dimensional feature vector obtained by quantization processing performed by the speech information preprocessing unit 103 and then converted into an NxN feature matrix.
The LSTM network is used as an improved network of the traditional recurrent neural network, stores voice information for a long time, is a neural network with a memory function, and can model time sequence data. In the invention, the LSTM network is formed by sequentially connecting an LSTM1 layer and an LSTM2 layer, wherein the LSTM1 layer has an output dimension of 50, the LSTM2 layer has an output dimension of 30, and the activation functions are all Tanh; the LSTM network is used for processing long-sequence voice, and the LSTM network model outputs a voice characteristic sequence with 30 dimensions.
Specifically, the LSTM1 layer and the LSTM2 layer of the LSTM network mainly include: output gate, input gate, forget gate, control output by parameters of each gateInformation processing system
Figure SMS_29
Use->
Figure SMS_30
And->
Figure SMS_31
Respectively representing the input value and the output value of the LSTM network, t time candidate memory cell information +.>
Figure SMS_32
The calculation is as follows:
Figure SMS_33
input door
Figure SMS_34
From the current input data->
Figure SMS_35
And the cell output of the previous time>
Figure SMS_36
Determining, the calculation formula is:
Figure SMS_37
forgetting door
Figure SMS_38
And controlling the transfer of history information, wherein the calculation formula is as follows:
Figure SMS_39
output door
Figure SMS_40
The calculation formula is as follows:
Figure SMS_41
the CNN network, i.e., the convolution layer, may be regarded as a fuzzy filter, which performs a convolution operation on the feature vector processed by the upper layer and the convolution kernel of the current layer, enhancing the features of the original signal and reducing noise. And finally, giving a convolution calculation result by the activation function. The convolution layer may be described as:
Z(n) = x(n) * w(n)
Figure SMS_42
the signal x (n) is a voice characteristic sequence with the dimension of 30, which is output after a voice signal passes through two LSTM layers and one BN layer, w (n) is a convolution kernel, and the output result z (n) of the convolution layer is obtained by convolving the signal x (n) with the convolution kernel w (n) with the size of L.
In a specific embodiment of the present invention, the CNN network is sequentially connected by a conv1D1 layer, a pooling layer, and a conv1D2 layer, where the number of conv1D1 layer filters is 512, the convolution kernel size is 3, the number of conv1D2 layer filters is 256, the convolution kernel size is 3, and the activation functions are all Relu; the pool size of the maximum pooling layer is 2; the CNN network output layer is the voice characteristic sequence after filtering.
Of course, the autism speech feature recognition model further includes a pooling layer, a BN2 layer, a flat layer, a dropout layer, a full connection layer, and an output layer, where the pooling layer is mainly used to remove redundant information, compress features, and simplify complexity of a neural network, the BN2 layer is mainly used to accelerate training and convergence speed of the network and prevent overfitting, the flat layer is mainly used to unidimensionally input, the full connection layer is mainly used to classify information, and the output layer is mainly used to output sequences from the full connection layer, and since the pooling layer, the BN2 layer, the flat layer, the dropout layer, the full connection layer, and the output layer are not important points of the present invention, implementation of the pooling layer, the BN2 layer, the dropout layer, the full connection layer, and the output layer are the same as in the prior art, so that the key point of the present invention is not described herein.
After the autism speech feature recognition model is established, the quantized speech features obtained by the speech acquisition unit 102 and the speech information preprocessing unit 103 are used as sensing signals to be input into the autism speech feature recognition model, the performance features of the speech features on the sensing signals are learned, the autism speech feature recognition model is trained by using a back propagation method, the optimization of the classifier on the network weight is realized, and finally the autism speech feature recognition model capable of being used for speech signal recognition is obtained.
And S2, collecting voice information of the tested person in the interaction process of the robot and the tested person.
In the specific embodiment of the invention, the voice information in the interactive screening process of the robot and the tested person can be acquired through a built-in microphone of the robot or a wearable microphone on the evaluator and the tested person. In the invention, the robot is used as a main body in the screening process and has the characteristic of a human-like body, and the interest of the testee is attracted by displaying songs and dances to the testee, and the testee is guided to send out more voice information as much as possible.
And S3, preprocessing the collected voice information, and quantizing the voice characteristic into an M-dimensional voice characteristic vector.
Specifically, step S3 further includes:
step S300, pre-emphasis is performed on the input speech signal.
In the specific embodiment of the invention, the pre-emphasis is realized by adopting a digital filter, and the digital filter for voice passing in the pre-emphasis is as follows:
Figure SMS_43
where μ is the pre-emphasis coefficient and z is a complex number, referring to the frequency of the speech signal.
The relation between the output and input speech signals S (n) of the pre-emphasis network is:
Figure SMS_44
where a is also the pre-emphasis coefficient.
In step S301, the speech signal is segmented to analyze the characteristic parameters thereof, and a time series of characteristic parameters consisting of the characteristic parameters of each frame is analyzed.
In the specific embodiment of the invention, the framing and windowing is implemented by weighting a movable limited-length window, that is, a certain window function w (n) is used to multiply s (n), and the windowing signal is:
S_W(n) = S(n)*w(n)
the invention uses Hamming window with the following window functions:
Figure SMS_45
in step S302, for each frame signal, a corresponding spectrum is obtained by Fast Fourier Transform (FFT). Specifically, after the voice signal S (n) is multiplied by the hamming window in the frame windowing of step S301, each frame must be further subjected to a fast fourier transform to obtain an energy distribution over the frequency spectrum, that is, to obtain a corresponding frequency spectrum.
Step S303, the spectrum obtained by fast Fourier is passed through a set of triangular filter banks of Mel scale to obtain Mel spectrum, in the invention, a filter bank of M filters is defined, the adopted filter is triangular filter, M is usually 22-26. The triangular band-pass filtering module aims to smooth the frequency spectrum and eliminate the function of harmonic waves, and highlights formants of the original voice.
Step S304, the logarithmic energy of each frame signal is calculated to distinguish the unvoiced sound and the voiced sound and judge the silent section and the voiced section in each frame, wherein the logarithmic energy refers to the volume, the calculating method is the square sum of signals in one frame, the logarithmic value based on 10 is taken, and the logarithmic value is multiplied by 10, so that the basic voice characteristics of each frame are multiple in one dimension.
Logarithm of energy S (m)
Figure SMS_46
In step S305, the logarithmic energy is substituted into the discrete cosine transform formula to obtain the MEL inverse general parameter C (n) of the L-order. L is the order of the speech feature, typically taking 12-16, and M is the number of triangular filters. The following is the formula for discrete cosine transform:
Figure SMS_47
wherein, C (n) is the final required voice feature, i.e. the feature matrix of NxN converted by the M-dimensional feature vector.
And S4, performing voice feature recognition on the voice signals acquired in the step S2 and quantized in the step S3 by using a trained autism voice feature recognition model. In a specific embodiment of the present invention, the emotion classification result output by the autism speech feature recognition model includes, but is not limited to: happiness, qi, fear, sadness, surprise and neutrality.
Examples
Fig. 4 is a schematic layout diagram of an experimental site in an embodiment of the invention. As shown in a scene of FIG. 1, the robot is designed as a humanoid robot, one person to be tested, one evaluator and one humanoid robot are arranged in an experimental scene, the humanoid robot is arranged on a desktop of a test site, the front face of the humanoid robot faces against the person to be tested, and the humanoid robot and the person to be tested face to face are separated by a distance of 0.7-1 meter.
As shown in fig. 5, the process flow of the present embodiment is as follows:
step S1, man-machine interaction is performed, and the whole process is mainly participated by a humanoid robot.
Step S1.1, the humanoid robot performs simple self-introduction to a tested person, and simultaneously tests the operation condition of related equipment.
Step S1.2, the humanoid robot makes a simple question to the test, such as "hello, i are XXX robots, ask you what name? "and the like.
Step S1.3, the humanoid robot displays songs to the tested, for suspected low-function autism objects, the evaluator can trigger corresponding voice instructions to the robot, and if the suspected high-function autism objects are, the evaluator can conduct certain guidance to the tested, and the instructions are triggered through the tested voice information. The evaluator may make a record of the relationship by observing the response of the test in the field.
Step S1.4, the humanoid robot displays dance to the tested, for suspected low-function autism objects, the evaluator can send corresponding voice instructions to the humanoid robot to trigger, and for suspected high-function autism objects, the evaluator can conduct certain guidance to the tested, and the instructions are triggered through the voice information of the tested. The evaluator may make a record of the relationship by observing the response of the test in the field.
And S2, data acquisition, wherein in the interaction process, a microphone built in the humanoid robot, a tested person and a wearable microphone on an evaluator can record in the whole process. And acquiring the sound recording file stored in the humanoid robot system from the PC end through software wincsp.
Step S3, preprocessing, and carrying out relevant processing on the voice on the PC side.
Step S3.1, a digital filter is adopted to realize pre-emphasis, and the relation between the output and input voice signals S (n) of the pre-emphasis network is as follows:
Figure SMS_48
and S3.2, framing, namely segmenting the voice signal to analyze the characteristic parameters, and analyzing a characteristic parameter time sequence consisting of the characteristic parameters of each frame.
Step S3.3, windowing emphasizes the speech waveform around sample n and attenuates the rest of the waveform.
In step S3.4, after the frame-wise windowed speech signal S (n) is multiplied by the hamming window, each frame must be subjected to a fast fourier transform to obtain an energy distribution over the frequency spectrum, i.e. to obtain the corresponding frequency spectrum.
Step S5, feature quantization, including triangular bandpass filtering, log energy calculation and discrete cosine transformation, is described in the specification, and will not be repeated in this section.
And S6, recognition analysis, namely recognizing the voice characteristics of the tested person by using the autism voice characteristic recognition model.
And S7, preprocessing all the voice files in the experimental process and storing the voice files into a data set folder.
And S7.1, extracting the voice characteristics in the updated data set again.
And S7.2, training the voice characteristic recognition model again, and continuously adjusting the model structure according to the training result.
In summary, the present invention provides a robot and a method for assisting in identifying autism speech features, which designs an autism speech feature identification model by using a long short term memory neural network (LSTM) and a Convolutional Neural Network (CNN). The method comprises the steps of collecting autism voice characteristics, inputting voice sensing signals as sensing signals, utilizing the autism recognition model to learn the performance characteristics of the autism voice characteristics in the sensing signals, utilizing a back propagation method to train the autism voice characteristic recognition model, realizing optimization of a classifier in network weights, finally obtaining the autism voice characteristic recognition model, and then utilizing the obtained autism voice characteristic recognition model to recognize the voice signals of autism patients, so that the problems of large manual screening errors and low efficiency in the existing autism screening technology can be assisted to be solved.
The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, the scope of the invention is to be indicated by the appended claims.

Claims (8)

1. An autism voice feature assisted recognition robot, comprising:
the self-closing symptom voice feature recognition model construction and training unit constructs a self-closing symptom voice feature recognition model by using a long and short memory neural network and a convolution neural network, takes quantized voice features as sensing signals to be input into the self-closing symptom voice feature recognition model, learns the performance features of the voice features in the sensing signals, trains the self-closing symptom voice feature recognition model by using a back propagation method, realizes optimization of a classifier in network weight, and finally obtains the self-closing symptom voice feature recognition model capable of being used for voice signal recognition;
the voice acquisition unit is used for acquiring voice information of the tested person in the interaction process of the robot and the tested person;
the voice information preprocessing unit is used for preprocessing the collected voice information and quantizing voice characteristics into M-dimensional voice characteristic vectors;
the voice feature recognition unit is used for carrying out voice feature recognition on the voice signals acquired by the voice acquisition unit and processed by the voice information preprocessing unit by utilizing the trained autism voice feature recognition model;
the autism voice feature recognition model is sequentially connected with an input layer, an LSTM network layer, a BN1 layer, a CNN network layer, a pooling layer, a BN2 layer, a flame layer, a dropout layer, a full-connection layer and an output layer;
the LSTM network is used for processing long-sequence voice, and is sequentially connected by an LSTM1 layer and an LSTM2 layer, the activating functions of the LSTM1 layer and the LSTM2 layer are both Tanh, and the output of the LSTM network is a voice characteristic sequence; the LSTM1 layer output dimension is 50 and the LSTM2 layer output dimension is 30.
2. An autism voice feature assisted recognition robot as in claim 1, wherein: the LSTM1 layer and the LSTM2 layer of the LSTM network respectively comprise an output gate, an input gate and a forget gate, and the parameters of the gates are used for controlling the output information
Figure QLYQS_1
Input door->
Figure QLYQS_2
From the current input data->
Figure QLYQS_3
And the cell output of the previous time>
Figure QLYQS_4
Decision, forget door->
Figure QLYQS_5
Control the transmission of history information, output gate->
Figure QLYQS_6
3. An autism voice feature assisted recognition robot as in claim 2, wherein: the CNN network is a convolution layer, the convolution operation is carried out on the feature vector processed by the upper layer and the convolution kernel of the current layer, the features of the original signal are enhanced, the noise is reduced, and finally the convolution calculation result is given out by the activation function.
4. A autism voice feature assisted recognition robot as in claim 3, wherein: the CNN network is sequentially connected by a conv1D1 layer, a pooling layer and a conv1D2 layer.
5. The autism speech feature assisted recognition robot of claim 4, wherein the speech information preprocessing unit further comprises:
the pre-emphasis processing module is used for pre-emphasizing an input voice signal;
the framing windowing module is used for segmenting the voice signal to analyze the characteristic parameters thereof and analyzing a characteristic parameter time sequence consisting of the characteristic parameters of each frame;
the fast Fourier transform module is used for obtaining a corresponding frequency spectrum for each frame of signal through fast Fourier transform;
the triangular band-pass filtering module is used for enabling the frequency spectrum obtained through fast Fourier to pass through a triangular filter bank with a set of Mel scales to obtain Mel frequency spectrum;
the logarithmic energy calculating module is used for calculating the logarithmic energy of each frame signal so as to distinguish unvoiced sound and voiced sound and judge a silent section and a voiced section in each frame;
the discrete cosine transform module is used for substituting the calculated logarithmic energy into a discrete cosine transform formula to calculate an L-order MEL inverse general parameter C (n).
6. The autism voice feature assisted recognition robot of claim 5, wherein the pre-emphasis voice-passing digital filter is:
Figure QLYQS_7
wherein μ is a pre-emphasis coefficient, z is a complex number, referring to the frequency of the speech signal;
the relation between the output and input speech signals S (n) of the pre-emphasis network is:
Figure QLYQS_8
where a is also the pre-emphasis coefficient.
7. The autism speech feature assisted recognition robot of claim 6, wherein the framing windowing module is implemented with weighting by a movable limited length window, the windowing signal being:
S_W(n) = S(n)*w(n)
the window function is:
Figure QLYQS_9
8. an autism voice feature auxiliary recognition method comprises the following steps:
step S1, constructing an autism voice feature recognition model based on a long and short memory neural network and a convolutional neural network, inputting quantized voice features serving as sensing signals into the autism voice feature recognition model, learning the performance features of the voice features in the sensing signals, training the autism voice feature recognition model by using a back propagation method, optimizing the network weight of a classifier, and finally obtaining the autism voice feature recognition model capable of being used for voice signal recognition;
s2, collecting voice information of a tested person in the interaction process of the robot and the tested person;
step S3, preprocessing the collected voice information, and quantizing voice features into M-dimensional voice feature vectors;
step S4, voice characteristic recognition is carried out on the voice signals acquired in the step S2 and processed in the step S3 by using a trained autism voice characteristic recognition model;
the autism voice feature recognition model is sequentially connected with an input layer, an LSTM network layer, a BN1 layer, a CNN network layer, a pooling layer, a BN2 layer, a flame layer, a dropout layer, a full-connection layer and an output layer;
the LSTM network is used for processing long-sequence voice, and is sequentially connected by an LSTM1 layer and an LSTM2 layer, the activating functions of the LSTM1 layer and the LSTM2 layer are both Tanh, and the output of the LSTM network is a voice characteristic sequence; the LSTM1 layer output dimension is 50 and the LSTM2 layer output dimension is 30.
CN202011016520.3A 2020-09-24 2020-09-24 Robot and method for assisting in identifying autism voice features Active CN112259126B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011016520.3A CN112259126B (en) 2020-09-24 2020-09-24 Robot and method for assisting in identifying autism voice features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011016520.3A CN112259126B (en) 2020-09-24 2020-09-24 Robot and method for assisting in identifying autism voice features

Publications (2)

Publication Number Publication Date
CN112259126A CN112259126A (en) 2021-01-22
CN112259126B true CN112259126B (en) 2023-06-20

Family

ID=74231240

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011016520.3A Active CN112259126B (en) 2020-09-24 2020-09-24 Robot and method for assisting in identifying autism voice features

Country Status (1)

Country Link
CN (1) CN112259126B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106782602B (en) * 2016-12-01 2020-03-17 南京邮电大学 Speech emotion recognition method based on deep neural network
CN108053841A (en) * 2017-10-23 2018-05-18 平安科技(深圳)有限公司 The method and application server of disease forecasting are carried out using voice
CN108597539B (en) * 2018-02-09 2021-09-03 桂林电子科技大学 Speech emotion recognition method based on parameter migration and spectrogram
CN109192221A (en) * 2018-03-30 2019-01-11 大连理工大学 It is a kind of that phonetic decision Parkinson severity detection method is used based on cluster
US11545173B2 (en) * 2018-08-31 2023-01-03 The Regents Of The University Of Michigan Automatic speech-based longitudinal emotion and mood recognition for mental health treatment

Also Published As

Publication number Publication date
CN112259126A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN112581979B (en) Speech emotion recognition method based on spectrogram
CN103065629A (en) Speech recognition system of humanoid robot
CN105206270A (en) Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM)
CN108520753A (en) Voice lie detection method based on the two-way length of convolution memory network in short-term
Wang et al. Recognition of audio depression based on convolutional neural network and generative antagonism network model
CN114566189B (en) Speech emotion recognition method and system based on three-dimensional depth feature fusion
CN116741148A (en) Voice recognition system based on digital twinning
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
Whitehill et al. Whosecough: In-the-wild cougher verification using multitask learning
CN113129908B (en) End-to-end macaque voiceprint verification method and system based on cyclic frame level feature fusion
Jiang et al. Speech emotion recognition method based on improved long short-term memory networks
CN113380418A (en) System for analyzing and identifying depression through dialog text
CN116965819A (en) Depression recognition method and system based on voice characterization
CN112259126B (en) Robot and method for assisting in identifying autism voice features
Ma et al. A percussion method with attention mechanism and feature aggregation for detecting internal cavities in timber
CN114299995A (en) Language emotion recognition method for emotion assessment
CN115206288A (en) Cross-channel language identification method and system
Anindya et al. Development of Indonesian speech recognition with deep neural network for robotic command
CN113571050A (en) Voice depression state identification method based on Attention and Bi-LSTM
CN112735477A (en) Voice emotion analysis method and device
Kayal et al. Multilingual vocal emotion recognition and classification using back propagation neural network
Estrebou et al. Voice recognition based on probabilistic SOM
Nair et al. Transfer learning for speech based emotion recognition
Singh A text independent speaker identification system using ANN, RNN, and CNN classification technique

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant