CN112308037A - Facial paralysis detection method based on visual perception and audio information - Google Patents

Facial paralysis detection method based on visual perception and audio information Download PDF

Info

Publication number
CN112308037A
CN112308037A CN202011340158.5A CN202011340158A CN112308037A CN 112308037 A CN112308037 A CN 112308037A CN 202011340158 A CN202011340158 A CN 202011340158A CN 112308037 A CN112308037 A CN 112308037A
Authority
CN
China
Prior art keywords
mouth
sequence
audio information
motion
facial paralysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011340158.5A
Other languages
Chinese (zh)
Inventor
陈永宁
袁梦
刘亚刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Suyi Electronic Technology Co ltd
Original Assignee
Zhengzhou Suyi Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Suyi Electronic Technology Co ltd filed Critical Zhengzhou Suyi Electronic Technology Co ltd
Priority to CN202011340158.5A priority Critical patent/CN112308037A/en
Publication of CN112308037A publication Critical patent/CN112308037A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/66Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Geometry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a facial paralysis detection method based on visual perception and audio information, which solves the problem that the detection and diagnosis of facial nerve paralysis diseases cannot be fast and easily realized. The invention comprises the following steps of collecting RGB and Depth images; obtaining RGB image I of human face region1And a depth map I2(ii) a Obtaining 2D key points and 3D key points of the face; guiding the person to be tested to read the characters; appointing a characteristic value reflecting mouth movement to obtain a characteristic sequence of the mouth movement; collecting audio information, performing framing and windowing operations, extracting a Mel cepstrum coefficient of each frame, and correcting a mouth motion sequence; the corrected mouth movement sequence CnewPerforming feature extraction to obtain motion similarityAn index S; analyzing the frequency distribution of the audio to obtain the speaking definition D; and (5) integrating the S and the D to obtain a facial paralysis detection result. The technology corrects the characteristic sequence of the mouth movement by using the audio information, and the accuracy is higher.

Description

Facial paralysis detection method based on visual perception and audio information
Technical Field
The invention relates to the technical field of machine vision image processing, in particular to a facial paralysis detection method based on visual perception and audio information.
Background
Facial paralysis is a disease with facial expression muscle movement dysfunction as the main characteristic, and the general symptoms are facial distortion and uncoordinated facial expression. Due to the inability to control facial muscles, facial paralysis patients often do not have sufficient flexibility in their movements on their face and may present other symptoms including running water, speech problems, and nasal congestion, among others.
In the prior art, the facial paralysis detection usually enables a person to be detected to make a specified action, and the facial paralysis detection is realized by observing the condition that the person completes the specified action. However, when a person intentionally completes a designated action, the subjective awareness is strong, and the detection result of facial paralysis has errors. And the method can not finely divide the facial paralysis degree by detecting facial paralysis with the detection results of a few movements. Therefore, in order to solve the above problems, the present invention provides a facial paralysis detection system based on visual perception and audio information.
Disclosure of Invention
The invention overcomes the problem that the detection and diagnosis of the facial paralysis disease in the prior art cannot be fast and simple, and provides a facial paralysis detection method based on visual perception and audio information.
The technical solution of the present invention is to provide a facial paralysis detection method based on visual perception and audio information, which comprises the following steps: comprises the following steps: the method comprises the following steps: an RGBD camera is arranged above a display screen, an RGB image and a Depth image are collected, the visual angle of the camera is in front view of a human face, and audio information of a person to be detected is collected; step two: RGB image I for obtaining human face area by using RGB image1And a depth map I2(ii) a Step three: for image I1Obtaining 2D-landmark of human face by using key point detection network, and combining with depth map I2Obtaining 3D-landmark of the human face; step four: displaying appointed characters on a display screen, and guiding a person to be tested to read the characters on the screen; step five: collecting mouth movement conditions of a person to be tested when reading characters, and appointing a characteristic value reflecting the mouth movement to obtain a characteristic sequence of the mouth movement; step six: simultaneously with the fifth step, collecting audio information and framing,Windowing, extracting the Mel cepstrum coefficient of each frame, and correcting the mouth motion sequence; step seven: the corrected mouth movement sequence CnewCarrying out feature extraction, adjusting feature values by using the speed of speech, and comparing the feature values with a standard mouth motion feature sequence to obtain a motion similarity index S; step eight: analyzing the frequency distribution of the audio to obtain the speaking definition D of the person to be tested; step nine: and (5) integrating the motion similarity index S and the speaking definition D to obtain a facial paralysis detection result.
Preferably, the second step comprises the following steps:
step 2.1, sending the collected RGB image into a semantic segmentation network to obtain a Mask of the face area, and directly multiplying the Mask image and the collected RGB image to obtain an image I of the face area1Multiplying the depth image to obtain a depth image I of the face region2
Step 2.2, the semantic segmentation network uses images collected by a camera as a training set during training, the training set is artificially labeled, the pixel point label of a human face region is 1, the pixel points of other regions are 0, labeling data are obtained, the training set and the labeling data are used for training the network, a cross entropy function is adopted as a loss function, parameters in the model are continuously updated, and an implementer uses semantic segmentation networks such as U-Net, DeepLabv3 +;
and 2.3, when the network is used, directly sending the acquired RGB image into a semantic segmentation network, wherein the output Mask is a binary image, the pixel value of the face region is 1, and the pixel values of other regions are 0.
Preferably, the third step comprises the following steps:
performing face key point detection on the image I to obtain 2D-landmark, and selecting an Openface network and a DAN network according to actual conditions to realize the face key point detection, wherein the depth map I is used for2And face RGB image I1And aligning on the pixel, and further obtaining the corresponding depth value of each point in the 2D-landmark to obtain the 3D-landmark of the human face.
Preferably, the step five comprises the following steps:
the key point of the nose tip is taken as the root node and is marked as P0And the key point of the left and right mouth corners is marked as P1,P2The center points of the upper and lower lips are denoted as P3,P4By root node P0As a starting point, connecting the other points to obtain a vector P in four directions0P1And P0P2The change of the included angle reflects the change condition of the mouth angle, and the included angle of the two vectors is recorded as alpha1The difference between the two vector modes reflects the degree of mouth skewness, and the mode length difference of the two vectors is recorded as sigma | | P0P1|-|P0P2L; vector P0P3And P0P4The included angle reflects the motion condition of the upper lip and the lower lip, and the included angle of the two vectors is recorded as alpha2(ii) a Determining a characteristic value of angular movement of the mouth { alpha }1,α2σ, extracting three characteristic values of mouth movement each time to obtain the shape of [3, N ]]Wherein N represents the number of times the motion information is acquired, and the sequence is denoted as C ═ C1,c2,c3,.....cNWhere c is 3 rows and 1 columns, containing three values characteristic of mouth angular movement.
Preferably, the method for acquiring the MFCC in the sixth step includes the following steps:
a) pre-emphasis processing is applied to the collected audio signal, and the formula of pre-emphasis is H (Z) -1-muz-1Wherein Z is the collected voice signal, and the value of mu is usually 0.97;
b) performing framing operation on the audio information according to the characteristics that the audio information is stable when the time is long and the time is short, determining the period of the audio signal by using a short-time autocorrelation function, determining the length of a time period corresponding to a single frame of voice according to the period, and performing framing operation to obtain the audio signal of each period;
c) carrying out fast Fourier transform operation on the audio signal of each period to obtain a corresponding linear spectrogram X [ m ]]=H[m]E[m]Processing the linear spectrogram by using a Mel filter bank to obtain a nonlinear Mel spectrum which embodies human auditory characteristics; for the Mel frequency spectrum, only two sides of the amplitude are considered to take logarithm, the logarithm base is 10, namely the Mel spectrogram log | | X [ m |)]||=log||H[m]||+log||E[m]I, then taking inverse Fourier transform on two sides, i.e. x [ m ]]=h[m]+e[m]The inverse Fourier transform is realized by discrete cosine transform, and the 2 nd to 13 th coefficients after DCT are taken as MFCC coefficients; the MFCC coefficient of each frame in the audio signal reflects the feature of the frame speech to obtain a modified mouth motion feature sequence Cnew={c1,c2,c3,.....cMAnd M is the acquisition times of the corrected mouth movement information.
Preferably, the seventh step comprises the steps of:
step 1, correcting the mouth movement sequence CnewCarrying out feature extraction, adjusting feature values by using the speed of speech, and comparing the feature values with a standard mouth motion feature sequence to obtain a motion similarity index S;
step 2, extracting mouth motion sequence CnewThe feature matrix H is constructed by using the feature values of (a), and the specific method for constructing the feature matrix H is as follows:
a)Cneweach value c in the sequence has three characteristic values alpha1,α2The feature value sigma reflects the mouth angle deflection degree of the person to be tested, the larger the sigma is, the larger the mouth angle deflection degree is, the higher the probability that the person to be tested is a facial paralysis patient is, the mouth movement information at the moment is considered as key information, and larger weight should be distributed; setting an empirical threshold σ0When it exceeds σ0When the abnormal mouth is skewed, the corresponding mouth movement information is assigned with a large weight, and the weight assignment calculation method comprises the following steps:
Figure BDA0002798395730000031
wherein σiRepresenting the value of σ, w, in the motion information of the i-th acquisitioniA weight representing the ith motion information;
b) calculating the sequence C in combination with the assigned weightsnewMiddle alpha1,α2Weighted mean of the characteristic values to obtain
Figure BDA0002798395730000032
As one of the characteristic values, to
Figure BDA0002798395730000033
For example, the formula is calculated as follows:
Figure BDA0002798395730000034
wherein alpha is1(ci) Representing alpha in the ith motion information1A value;
c) calculating the sequence CnewMiddle alpha1,α2Variance of the characteristic value to obtain
Figure BDA0002798395730000035
As one of the characteristic values, to
Figure BDA0002798395730000036
For example, the formula is calculated as follows:
Figure BDA0002798395730000037
d) to this end, a feature matrix of mouth motion sequences is obtained and recorded as
Figure BDA0002798395730000038
Step 3, the characteristic matrix H of the collected mouth motion sequence and the characteristic matrix H of the standard mouth motion sequence obtained in advance are used0Comparing to obtain similarity value of motion characteristics, and obtaining standard mouth action sequence by obtaining mouth sequences of multiple healthy persons in advance and obtaining characteristic matrix H by the same method0The similarity S calculation method comprises the following steps: s | | | H-H0||2The greater the similarity is, the smaller the S value is, which indicates that the possibility of the facial paralysis of the person to be tested is smaller.
Preferably, the step eight comprises the following steps: obtaining the speaking definition D of the person to be tested according to the frequency distribution in the audio information, analyzing the spectrogram X [ m ] of each frame obtained in the sixth step, marking the current frame as 1 when the amplitude at the 3KHz position is highest, otherwise marking as 0, converting the audio signal of the person to be tested into a binary sequence, counting the proportion of the value 1 in the sequence, and marking as the speaking definition D of the person to be tested:
Figure BDA0002798395730000039
wherein, the larger the D value is, the less clear the speaking is, and the possibility of facial paralysis of the person to be tested is high.
Preferably, the step nine comprises the following steps: synthesizing the mouth movement similarity index S and the speaking definition D to obtain a facial paralysis index Pro: pro ═ ln (SD +1), Pro is compared with an empirical threshold Pro0By comparison, when Pro > Pro0When the test person is facial paralysis patient, Pro is taken0The implementer changes the threshold value according to the actual operating conditions to 1.6.
Compared with the prior art, the facial paralysis detection method based on visual perception and audio information has the following advantages: firstly, a 3D-landmark of a human face is obtained by using an RGBD image, and a person to be measured is appointed to read specified characters. The mouth movement condition and the audio information of the person to be detected are collected, the feature sequence of the mouth movement is extracted, the feature sequence of the mouth movement is corrected by the audio information, and compared with a single action, the motion sequence can reflect the real condition of a patient. And detecting the mouth movement sequence and the audio information of the person to be detected, comparing the corrected mouth movement characteristic sequence with the standard movement sequence, and comprehensively detecting the mouth movement flexibility and the mouth tooth definition of the person to be detected by combining the frequency distribution condition of the audio information to obtain the facial paralysis result of the person. In consideration of the situation that repeated reading occurs when a person to be detected reads, the motion sequence of the mouth is corrected by using the audio information, so that the motion sequence of the mouth during normal reading is obtained, and the detection accuracy is improved.
Drawings
FIG. 1 is a schematic workflow diagram of the present invention.
Detailed Description
The facial paralysis detection method based on visual perception and audio information of the invention is further explained with the accompanying drawings and the detailed implementation mode: as shown in fig. 1, the present embodiment includes the following steps: the method comprises the following steps: an RGBD camera is arranged above a display screen to collect face RGB images and Depth images (Depth images) of people to be detected, the camera is in front view at the view angle, and the view field covers the whole area of the face. Meanwhile, the camera can collect the audio information of the person to be tested.
Step two: RGB image I for obtaining human face area by using RGB image1And a depth map I2
The purpose of this step is to eliminate the influence of the background on the detection of the face region and isolate the irrelevant working conditions. The specific operation method comprises the following steps: sending the collected RGB image into a semantic segmentation network to obtain a Mask of the face region, directly multiplying the Mask image with the collected RGB image to obtain an image I of the face region1(ii) a Multiplying the depth image to obtain a depth image I of the face area2And the semantic segmentation network uses images acquired by a camera as a training set during training. And (4) artificially labeling the training set, labeling the pixel points of the face region to be 1, and labeling the pixel points of other regions to be 0, so as to obtain labeling data. And training the network by using the training set and the labeled data, wherein the loss function adopts a cross entropy function, and the parameters in the model are continuously updated. Implementers can use semantics split networks such as U-Net, DeepLabv3+, and the like. When the network is used, the collected RGB image is directly sent to a semantic segmentation network, the output Mask is a binary image, the pixel value of the face area is 1, and the pixel values of other areas are 0.
Step three: for image I1Performing key point to obtain 2D-landmark of human face, and combining with depth map I2Obtaining 3D-landmark of the human face;
in the step, firstly, the image I is subjected to face key point detection to obtain the 2D-landmark, and a plurality of methods for obtaining the face 2D-landmark are available, are known technologies and are not in the protection range of the invention, and an implementer can select Op according to actual conditionsThe enFace network, the DAN network and other prior art. Due to depth map I2And face RGB image I1The pixels are aligned, so that the depth values corresponding to all points in the 2D-landmark can be further directly obtained, the 3D-landmark of the face can be obtained, and the subsequent determination of the characteristic value of the mouth movement of the person to be detected is facilitated.
Step four: displaying appointed characters on a display screen, and guiding a person to be tested to read the characters on the screen; in order to facilitate the follow-up analysis of the audio information of the person to be tested, repeated characters cannot appear in the specified read characters, the follow-up analysis difficulty of the audio is reduced, and the accuracy is improved.
Step five: collecting mouth movement conditions when a person reads characters, and determining characteristic values based on key points of a mouth in 3D-landmark to obtain a characteristic sequence of mouth movement;
according to the priori knowledge, when people read characters, the position of the key point of the nose tip is kept unchanged, and the movement condition of the key point of the mouth corner can reflect the movement characteristics of the mouth. The key point of the nose tip is taken as the root node and is marked as P0And the key point of the left and right mouth corners is marked as P1,P2The center points of the upper and lower lips are denoted as P3,P4By root node P0As a starting point, the rest points are connected to obtain four directions. Vector P0P1And P0P2The change of the included angle reflects the change condition of the mouth angle, and the included angle of the two vectors is recorded as alpha1Meanwhile, the difference value of the two vector modes can reflect the deflection degree of the mouth, and the mode length difference value of the two vectors is recorded as sigma | | | P0P1|-|P0P2L. Vector P0P3And P0P4The included angle reflects the motion condition of the upper lip and the lower lip, and the included angle of the two vectors is recorded as alpha2(ii) a To this end, a characteristic value { alpha ] of the angular movement of the mouth is determined1,α2,σ}。
In order to reduce the pressure of data storage on hardware, the invention sets the sampling frequency to collect motion information once every two frames. Three characteristic values of mouth movement are extracted each time to obtain the shape of [3, N]Wherein N represents the collected motion informationThe number of times. The sequence is denoted as C ═ C1,c2,c3,.....cNWhere c is 3 rows and 1 columns, containing three values characteristic of mouth angular movement.
Step six: the step five is carried out simultaneously, audio information is collected and is subjected to framing and windowing operation, MFCC (Mel cepstrum coefficient) of each frame is extracted, and the mouth motion sequence obtained in the step four is corrected in time; the phenomenon that facial paralysis patients have unclear mouth and teeth and language disorder during reading is considered, and repeated reading can occur due to the influence of psychological factors. When repeated reading is carried out, repeated actions can occur in mouth movement, and influence is caused on subsequent analysis, so that feature extraction is carried out on audio information to correct the mouth movement feature sequence C.
Performing a series of operations such as framing, windowing, and FFT on the acquired audio information to obtain a mel-frequency cepstrum coefficient (MFCC) of each frame of audio, where the coefficient may reflect the characteristics of the audio signal, and the specific MFCC obtaining method is as follows:
d) in order to improve the high frequency portion, the spectrum of the signal is flattened, and the spectrum can be obtained with the same signal-to-noise ratio while maintaining the entire frequency band from low frequency to high frequency. Meanwhile, in order to highlight the formants of high frequency, pre-emphasis processing is adopted for the collected audio signals. The formula of pre-emphasis is H (Z) ═ 1-muz-1Where Z is the collected speech signal and the value of μ is typically 0.97.
e) The audio information has the characteristic of being stable when being long and short, the framing operation is further carried out on the audio information, the period of the audio signal is determined by utilizing the short-time autocorrelation function, the length of the time period corresponding to the single-frame voice is determined according to the period, and the framing operation is carried out to obtain the audio signal of each period. Short-time autocorrelation functions are well known in the art and are not within the scope of the present invention. And will not be described in detail herein.
f) And performing FFT (fast Fourier transform) operation on the audio signal of each period to obtain a corresponding linear spectrogram X [ m ] ═ H [ m ] E [ m ], and processing the linear spectrogram by using a Mel filter bank to obtain a nonlinear Mel spectrum which can embody human auditory characteristics.
g) For Mel frequency spectrum, only two sides of amplitude are considered to be logarithmic, the logarithmic base is 10, namely Mel spectrogram
log||X[m]||=log||H[m]||+log||E[m]||,
And taking inverse fourier transform on two sides, namely x [ m ] ═ h [ m ] + e [ m ], wherein in actual operation, the inverse fourier transform is generally realized by DCT (discrete cosine transform), and taking the 2 nd to 13 th coefficients after DCT as MFCC coefficients.
h) To this end, MFCC coefficients are obtained for each frame in the audio signal, which coefficients may reflect the characteristics of the frame of speech.
Because repeated characters do not exist in the characters on the display screen, when the fact that the MFCC coefficient of the current frame in the audio information is the same as the MFCC coefficient of the historical frame is detected, the fact that the person to be tested reads repeatedly is indicated. This requires a correction to the mouth motion sequence.
It should be noted that the acquisition of the mouth movement sequence and the acquisition of the audio information are performed simultaneously, so that the two are aligned in time sequence. And detecting repeated reading along with the corresponding time period by using the audio information, and eliminating the mouth motion sequence in the corresponding time period to obtain the corrected mouth motion characteristic sequence. Recording the sequence of the corrected mouth movement characteristics as Cnew={c1,c2,C3,......cMAnd M is the acquisition times of the corrected mouth movement information.
It should be noted that the sampling frequency of the mouth motion is once every two frames, and the audio information is once sampled in one period, and because the sampling frequencies are different, one frame of audio information may correspond to a plurality of mouth motion feature values.
Step seven: the corrected mouth movement sequence CnewCarrying out feature extraction, adjusting feature values by using the speed of speech, and comparing the feature values with a standard mouth motion feature sequence to obtain a motion similarity index S;
extracting mouth motion sequence CnewThe feature matrix H is constructed by using the feature values of (a), and the specific method for constructing the feature matrix H is as follows:
e)Cneweach value c in the sequence has three characteristic values alpha1,α2And σ. The characteristic value sigma can reflect the mouth angle deflection degree of the person to be measured. The larger the sigma is, the larger the degree of mouth angle deflection is, and the larger the probability that the person to be measured is a facial paralysis patient is, the larger the weight should be assigned because the mouth movement information at that time is considered as key information. Setting an empirical threshold σ0When it exceeds σ0When the abnormal mouth is skewed, the corresponding mouth movement information is assigned with a large weight, and the weight assignment calculation method comprises the following steps:
Figure BDA0002798395730000051
wherein σiRepresenting the value of σ, w, in the motion information of the i-th acquisitioniRepresenting the weight of the ith motion information.
f) Calculating the sequence C in combination with the assigned weightsnewMiddle alpha1,α2Weighted mean of the characteristic values to obtain
Figure BDA0002798395730000061
As one of the characteristic values, to
Figure BDA0002798395730000062
For example, the formula is calculated as follows:
Figure BDA0002798395730000063
wherein alpha is1(ci) Representing alpha in the ith motion information1The value is obtained.
g) Calculating the sequence CnewMiddle alpha1,α2Variance of the characteristic value to obtain
Figure BDA0002798395730000064
As one of the characteristic values, to
Figure BDA0002798395730000065
For example, the formula is calculated as follows:
Figure BDA0002798395730000066
h) to this end, a feature matrix of mouth motion sequences is obtained and recorded as
Figure BDA0002798395730000067
The reason why the mean value and the variance are selected to construct the feature matrix is explained here, and the mean value and the variance reflect the average size of the feature values and the dispersion degree of distribution, and reflect the distribution characteristics of data. The method has small influence on the information acquisition time and times, is insensitive to the speed, avoids the difference caused by different speeds of speech and is convenient for subsequent similarity comparison.
The feature matrix H of the collected mouth motion sequence and the feature matrix H of the standard mouth motion sequence obtained in advance are combined0And comparing to obtain a similarity value of the motion characteristics. Obtaining standard mouth action sequence can obtain mouth sequences of a plurality of healthy persons in advance, and obtain a characteristic matrix H by the same method0. Here, a similarity S calculation method is given: s | | | H-H0||2The greater the similarity is, the smaller the S value is, which indicates that the possibility of the facial paralysis of the person to be tested is smaller.
Step eight: analyzing the frequency distribution of the audio to obtain the speaking definition D of the person to be tested;
according to the priori, the fact that the recognition sound of speaking can be masked due to the fact that the amplitude of the 3KHz position in the audio information is too high, namely, the mouth and teeth are unclear, lip sounds m, b and v are difficult to distinguish, and speaking definition D of a person to be detected can be obtained according to frequency distribution in the audio information.
For the spectrogram X [ m ] of each frame obtained in the sixth step]Analyzing, when the amplitude at the position of 3KHz is highest, marking the current frame as 1, otherwise marking the current frame as 0, converting the audio signal of the person to be tested into a sequence with the shape of 001110011.
Figure BDA0002798395730000068
The larger the D value is, the less clear the speaking is, and the possibility of facial paralysis of the person to be tested is high.
Step nine: and (5) integrating the motion similarity index S and the speaking definition D to obtain a facial paralysis detection result.
In view of the fact that facial paralysis patients have stiff movements, can not normally control facial muscles, stuttering and language disorder, the invention synthesizes the mouth movement similarity index S and the speaking clarity D to obtain the facial paralysis index Pro: pro ═ ln (SD +1),
pro is compared with an empirical threshold Pro0By comparison, when Pro > Pro0And then, the person to be tested is the facial paralysis patient. Pro in the invention0The implementer may change the threshold value according to the actual operating conditions 1.6.

Claims (8)

1. A facial paralysis detection method based on visual perception and audio information is characterized in that: comprises the following steps:
the method comprises the following steps: an RGBD camera is arranged above a display screen, an RGB image and a Depth image are collected, the visual angle of the camera is in front view of a human face, and audio information of a person to be detected is collected;
step two: RGB image I for obtaining human face area by using RGB image1And a depth map I2
Step three: for image I1Obtaining 2D-landmark of human face by using key point detection network, and combining with depth map I2Obtaining 3D-landmark of the human face;
step four: displaying appointed characters on a display screen, and guiding a person to be tested to read the characters on the screen;
step five: collecting mouth movement conditions of a person to be tested when reading characters, and appointing a characteristic value reflecting the mouth movement to obtain a characteristic sequence of the mouth movement;
step six: simultaneously carrying out the fifth step, collecting audio information, carrying out framing and windowing operation, extracting the Mel cepstrum coefficient of each frame, and correcting the mouth motion sequence;
step seven: the corrected mouth movement sequence CnewTo carry outExtracting characteristics, namely adjusting characteristic values by using the speed of speech, and comparing the characteristic values with a standard mouth movement characteristic sequence to obtain a movement similarity index S;
step eight: analyzing the frequency distribution of the audio to obtain the speaking definition D of the person to be tested;
step nine: and (5) integrating the motion similarity index S and the speaking definition D to obtain a facial paralysis detection result.
2. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the second step comprises the following steps:
step 2.1, sending the collected RGB image into a semantic segmentation network to obtain a Mask of the face area, and directly multiplying the Mask image and the collected RGB image to obtain an image I of the face area1Multiplying the depth image to obtain a depth image I of the face region2
Step 2.2, the semantic segmentation network uses images collected by a camera as a training set during training, the training set is artificially labeled, the pixel point label of a human face region is 1, the pixel points of other regions are 0, labeling data are obtained, the training set and the labeling data are used for training the network, a cross entropy function is adopted as a loss function, parameters in the model are continuously updated, and an implementer uses semantic segmentation networks such as U-Net, DeepLabv3 +;
and 2.3, when the network is used, directly sending the acquired RGB image into a semantic segmentation network, wherein the output Mask is a binary image, the pixel value of the face region is 1, and the pixel values of other regions are 0.
3. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the third step comprises the following steps:
performing face key point detection on the image I to obtain 2D-landmark, and selecting an Openface network and a DAN network according to actual conditions to realize the face key point detection, wherein the depth map I is used for2And face RGB image I1Are aligned on the pixel, and further obtain the corresponding depth value of each point in the 2D-landmarkAnd obtaining the 3D-landmark of the human face.
4. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the fifth step comprises the following steps:
the key point of the nose tip is taken as the root node and is marked as P0And the key point of the left and right mouth corners is marked as P1,P2The center points of the upper and lower lips are denoted as P3,P4By root node P0As a starting point, connecting the other points to obtain a vector P in four directions0P1And P0P2The change of the included angle reflects the change condition of the mouth angle, and the included angle of the two vectors is recorded as alpha1The difference between the two vector modes reflects the degree of mouth skewness, and the mode length difference of the two vectors is recorded as sigma | | P0P1|-|P0P2L; vector P0P3And P0P4The included angle reflects the motion condition of the upper lip and the lower lip, and the included angle of the two vectors is recorded as alpha2(ii) a Determining a characteristic value of angular movement of the mouth { alpha }1,α2σ, extracting three characteristic values of mouth movement each time to obtain the shape of [3, N ]]Wherein N represents the number of times the motion information is acquired, and the sequence is denoted as C ═ C1,c2,c3,......cNWhere c is 3 rows and 1 columns, containing three values characteristic of mouth angular movement.
5. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the method for acquiring the MFCC in the sixth step comprises the following steps:
a) pre-emphasis processing is applied to the collected audio signal, and the formula of pre-emphasis is H (Z) -1-muz-1Wherein Z is the collected voice signal, and the value of mu is usually 0.97;
b) performing framing operation on the audio information according to the characteristics that the audio information is stable when the time is long and the time is short, determining the period of the audio signal by using a short-time autocorrelation function, determining the length of a time period corresponding to a single frame of voice according to the period, and performing framing operation to obtain the audio signal of each period;
c) carrying out fast Fourier transform operation on the audio signal of each period to obtain a corresponding linear spectrogram X [ m ]]=H[m]E[m]Processing the linear spectrogram by using a Mel filter bank to obtain a nonlinear Mel spectrum which embodies human auditory characteristics; for the Mel frequency spectrum, only two sides of the amplitude are considered to take logarithm, the logarithm base is 10, namely the Mel spectrogram log | | X [ m |)]||=log||H[m]||+log||E[m]I, then taking inverse Fourier transform on two sides, i.e. x [ m ]]=h[m]+e[m]The inverse Fourier transform is realized by discrete cosine transform, and the 2 nd to 13 th coefficients after DCT are taken as MFCC coefficients; the MFCC coefficient of each frame in the audio signal reflects the feature of the frame speech to obtain a modified mouth motion feature sequence Cnew={c1,c2,c3,......cMAnd M is the acquisition times of the corrected mouth movement information.
6. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the seventh step comprises the following steps:
step 1, correcting the mouth movement sequence CnewCarrying out feature extraction, adjusting feature values by using the speed of speech, and comparing the feature values with a standard mouth motion feature sequence to obtain a motion similarity index S;
step 2, extracting mouth motion sequence CnewThe feature matrix H is constructed by using the feature values of (a), and the specific method for constructing the feature matrix H is as follows:
a)Cneweach value c in the sequence has three characteristic values alpha1,α2The feature value sigma reflects the mouth angle deflection degree of the person to be tested, the larger the sigma is, the larger the mouth angle deflection degree is, the higher the probability that the person to be tested is a facial paralysis patient is, the mouth movement information at the moment is considered as key information, and larger weight should be distributed; setting an empirical threshold σ0When it exceeds σ0When the abnormal mouth is skewed, the corresponding mouth movement information is assigned with a large weight, and the weight assignment calculation method comprises the following steps:
Figure FDA0002798395720000021
wherein σiRepresenting the value of σ, w, in the motion information of the i-th acquisitioniA weight representing the ith motion information;
b) calculating the sequence C in combination with the assigned weightsnewMiddle alpha1,α2Weighted mean of the characteristic values to obtain
Figure FDA0002798395720000022
As one of the characteristic values, to
Figure FDA0002798395720000023
For example, the formula is calculated as follows:
Figure FDA0002798395720000024
wherein alpha is1(ci) Representing alpha in the ith motion information1A value;
c) calculating the sequence CnewMiddle alpha1,α2Variance of the characteristic value to obtain
Figure FDA0002798395720000031
As one of the characteristic values, to
Figure FDA0002798395720000032
For example, the formula is calculated as follows:
Figure FDA0002798395720000033
d) to this end, a feature matrix of mouth motion sequences is obtained and recorded as
Figure FDA0002798395720000034
Step 3, the characteristic matrix H of the collected mouth motion sequence and the characteristic matrix H of the standard mouth motion sequence obtained in advance are used0Comparing to obtain similarity value of motion characteristics, and obtaining standard mouth action sequence by obtaining mouth sequences of multiple healthy persons in advance and obtaining characteristic matrix H by the same method0The similarity S calculation method comprises the following steps: s | | | H-H0||2The greater the similarity is, the smaller the S value is, which indicates that the possibility of the facial paralysis of the person to be tested is smaller.
7. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the step eight comprises the following steps: obtaining the speaking definition D of the person to be tested according to the frequency distribution in the audio information, analyzing the spectrogram X [ m ] of each frame obtained in the sixth step, marking the current frame as 1 when the amplitude at the 3KHz position is highest, otherwise marking as 0, converting the audio signal of the person to be tested into a binary sequence, counting the proportion of the value 1 in the sequence, and marking as the speaking definition D of the person to be tested:
Figure FDA0002798395720000035
wherein, the larger the D value is, the less clear the speaking is, and the possibility of facial paralysis of the person to be tested is high.
8. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the ninth step comprises the following steps: synthesizing the mouth movement similarity index S and the speaking definition D to obtain a facial paralysis index Pro: pro ═ ln (SD +1), Pro is compared with an empirical threshold Pro0By comparison, when Pro > Pro0When the test person is facial paralysis patient, Pro is taken0The implementer changes the threshold value according to the actual operating conditions to 1.6.
CN202011340158.5A 2020-11-25 2020-11-25 Facial paralysis detection method based on visual perception and audio information Withdrawn CN112308037A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011340158.5A CN112308037A (en) 2020-11-25 2020-11-25 Facial paralysis detection method based on visual perception and audio information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011340158.5A CN112308037A (en) 2020-11-25 2020-11-25 Facial paralysis detection method based on visual perception and audio information

Publications (1)

Publication Number Publication Date
CN112308037A true CN112308037A (en) 2021-02-02

Family

ID=74335513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011340158.5A Withdrawn CN112308037A (en) 2020-11-25 2020-11-25 Facial paralysis detection method based on visual perception and audio information

Country Status (1)

Country Link
CN (1) CN112308037A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688701A (en) * 2021-08-10 2021-11-23 江苏仁和医疗器械有限公司 Facial paralysis detection method and system based on computer vision
CN114419716A (en) * 2022-01-26 2022-04-29 北方工业大学 Calibration method for face key point calibration of face image
CN117577140A (en) * 2024-01-16 2024-02-20 北京岷德生物科技有限公司 Speech and facial expression data processing method and system for cerebral palsy children

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688701A (en) * 2021-08-10 2021-11-23 江苏仁和医疗器械有限公司 Facial paralysis detection method and system based on computer vision
CN113688701B (en) * 2021-08-10 2022-04-22 江苏仁和医疗器械有限公司 Facial paralysis detection method and system based on computer vision
CN114419716A (en) * 2022-01-26 2022-04-29 北方工业大学 Calibration method for face key point calibration of face image
CN114419716B (en) * 2022-01-26 2024-03-15 北方工业大学 Calibration method for face image face key point calibration
CN117577140A (en) * 2024-01-16 2024-02-20 北京岷德生物科技有限公司 Speech and facial expression data processing method and system for cerebral palsy children
CN117577140B (en) * 2024-01-16 2024-03-19 北京岷德生物科技有限公司 Speech and facial expression data processing method and system for cerebral palsy children

Similar Documents

Publication Publication Date Title
CN112308037A (en) Facial paralysis detection method based on visual perception and audio information
US11786171B2 (en) Method and system for articulation evaluation by fusing acoustic features and articulatory movement features
CN109846469B (en) Non-contact heart rate measurement method based on convolutional neural network
CN110969124A (en) Two-dimensional human body posture estimation method and system based on lightweight multi-branch network
CN101199208A (en) Method, system, and program product for measuring audio video synchronization
CN106491117A (en) A kind of signal processing method and device based on PPG heart rate measurement technology
CN112037788B (en) Voice correction fusion method
CN114241599A (en) Depression tendency evaluation system and method based on multi-modal characteristics
Ding et al. Deep connected attention (DCA) ResNet for robust voice pathology detection and classification
CN115578512A (en) Method, device and equipment for training and using generation model of voice broadcast video
CN115101191A (en) Parkinson disease diagnosis system
CN115910097A (en) Audible signal identification method and system for latent fault of high-voltage circuit breaker
Douros et al. Towards a method of dynamic vocal tract shapes generation by combining static 3D and dynamic 2D MRI speech data
CN112716468A (en) Non-contact heart rate measuring method and device based on three-dimensional convolution network
Freitas et al. Velum movement detection based on surface electromyography for speech interface
Lee et al. An exploratory study of emotional speech production using functional data analysis techniques
CN113963427B (en) Method and system for rapid in-vivo detection
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device
EP1489597A2 (en) Voice detection device
CN112507877A (en) System and method for detecting heart rate under condition of partial video information loss
CN116866783B (en) Intelligent classroom audio control system, method and storage medium
CN111260602B (en) Ultrasonic image analysis method for SSI
CN117475360B (en) Biological feature extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN
Rothkrantz et al. Comparison between different feature extraction techniques in lipreading applications
CN117558035B (en) Figure identity recognition system and method based on image technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210202