CN112308037A

CN112308037A - Facial paralysis detection method based on visual perception and audio information

Info

Publication number: CN112308037A
Application number: CN202011340158.5A
Authority: CN
Inventors: 陈永宁; 袁梦; 刘亚刚
Original assignee: Zhengzhou Suyi Electronic Technology Co ltd
Current assignee: Zhengzhou Suyi Electronic Technology Co ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-02-02

Abstract

The invention discloses a facial paralysis detection method based on visual perception and audio information, which solves the problem that the detection and diagnosis of facial nerve paralysis diseases cannot be fast and easily realized. The invention comprises the following steps of collecting RGB and Depth images; obtaining RGB image I of human face region₁And a depth map I₂(ii) a Obtaining 2D key points and 3D key points of the face; guiding the person to be tested to read the characters; appointing a characteristic value reflecting mouth movement to obtain a characteristic sequence of the mouth movement; collecting audio information, performing framing and windowing operations, extracting a Mel cepstrum coefficient of each frame, and correcting a mouth motion sequence; the corrected mouth movement sequence C_newPerforming feature extraction to obtain motion similarityAn index S; analyzing the frequency distribution of the audio to obtain the speaking definition D; and (5) integrating the S and the D to obtain a facial paralysis detection result. The technology corrects the characteristic sequence of the mouth movement by using the audio information, and the accuracy is higher.

Description

Facial paralysis detection method based on visual perception and audio information

Technical Field

The invention relates to the technical field of machine vision image processing, in particular to a facial paralysis detection method based on visual perception and audio information.

Background

Facial paralysis is a disease with facial expression muscle movement dysfunction as the main characteristic, and the general symptoms are facial distortion and uncoordinated facial expression. Due to the inability to control facial muscles, facial paralysis patients often do not have sufficient flexibility in their movements on their face and may present other symptoms including running water, speech problems, and nasal congestion, among others.

In the prior art, the facial paralysis detection usually enables a person to be detected to make a specified action, and the facial paralysis detection is realized by observing the condition that the person completes the specified action. However, when a person intentionally completes a designated action, the subjective awareness is strong, and the detection result of facial paralysis has errors. And the method can not finely divide the facial paralysis degree by detecting facial paralysis with the detection results of a few movements. Therefore, in order to solve the above problems, the present invention provides a facial paralysis detection system based on visual perception and audio information.

Disclosure of Invention

The invention overcomes the problem that the detection and diagnosis of the facial paralysis disease in the prior art cannot be fast and simple, and provides a facial paralysis detection method based on visual perception and audio information.

The technical solution of the present invention is to provide a facial paralysis detection method based on visual perception and audio information, which comprises the following steps: comprises the following steps: the method comprises the following steps: an RGBD camera is arranged above a display screen, an RGB image and a Depth image are collected, the visual angle of the camera is in front view of a human face, and audio information of a person to be detected is collected; step two: RGB image I for obtaining human face area by using RGB image₁And a depth map I₂(ii) a Step three: for image I₁Obtaining 2D-landmark of human face by using key point detection network, and combining with depth map I₂Obtaining 3D-landmark of the human face; step four: displaying appointed characters on a display screen, and guiding a person to be tested to read the characters on the screen; step five: collecting mouth movement conditions of a person to be tested when reading characters, and appointing a characteristic value reflecting the mouth movement to obtain a characteristic sequence of the mouth movement; step six: simultaneously with the fifth step, collecting audio information and framing,Windowing, extracting the Mel cepstrum coefficient of each frame, and correcting the mouth motion sequence; step seven: the corrected mouth movement sequence C_newCarrying out feature extraction, adjusting feature values by using the speed of speech, and comparing the feature values with a standard mouth motion feature sequence to obtain a motion similarity index S; step eight: analyzing the frequency distribution of the audio to obtain the speaking definition D of the person to be tested; step nine: and (5) integrating the motion similarity index S and the speaking definition D to obtain a facial paralysis detection result.

Preferably, the second step comprises the following steps:

step 2.1, sending the collected RGB image into a semantic segmentation network to obtain a Mask of the face area, and directly multiplying the Mask image and the collected RGB image to obtain an image I of the face area₁Multiplying the depth image to obtain a depth image I of the face region₂；

Step 2.2, the semantic segmentation network uses images collected by a camera as a training set during training, the training set is artificially labeled, the pixel point label of a human face region is 1, the pixel points of other regions are 0, labeling data are obtained, the training set and the labeling data are used for training the network, a cross entropy function is adopted as a loss function, parameters in the model are continuously updated, and an implementer uses semantic segmentation networks such as U-Net, DeepLabv3 +;

and 2.3, when the network is used, directly sending the acquired RGB image into a semantic segmentation network, wherein the output Mask is a binary image, the pixel value of the face region is 1, and the pixel values of other regions are 0.

Preferably, the third step comprises the following steps:

performing face key point detection on the image I to obtain 2D-landmark, and selecting an Openface network and a DAN network according to actual conditions to realize the face key point detection, wherein the depth map I is used for₂And face RGB image I₁And aligning on the pixel, and further obtaining the corresponding depth value of each point in the 2D-landmark to obtain the 3D-landmark of the human face.

Preferably, the step five comprises the following steps:

the key point of the nose tip is taken as the root node and is marked as P₀And the key point of the left and right mouth corners is marked as P₁，P₂The center points of the upper and lower lips are denoted as P₃，P₄By root node P₀As a starting point, connecting the other points to obtain a vector P in four directions₀P₁And P₀P₂The change of the included angle reflects the change condition of the mouth angle, and the included angle of the two vectors is recorded as alpha₁The difference between the two vector modes reflects the degree of mouth skewness, and the mode length difference of the two vectors is recorded as sigma | | P₀P₁|-|P₀P₂L; vector P₀P₃And P₀P₄The included angle reflects the motion condition of the upper lip and the lower lip, and the included angle of the two vectors is recorded as alpha₂(ii) a Determining a characteristic value of angular movement of the mouth { alpha }₁，α₂σ, extracting three characteristic values of mouth movement each time to obtain the shape of [3, N ]]Wherein N represents the number of times the motion information is acquired, and the sequence is denoted as C ═ C₁，c₂，c₃，.....c_NWhere c is 3 rows and 1 columns, containing three values characteristic of mouth angular movement.

Preferably, the method for acquiring the MFCC in the sixth step includes the following steps:

a) pre-emphasis processing is applied to the collected audio signal, and the formula of pre-emphasis is H (Z) -1-muz^-1Wherein Z is the collected voice signal, and the value of mu is usually 0.97;

b) performing framing operation on the audio information according to the characteristics that the audio information is stable when the time is long and the time is short, determining the period of the audio signal by using a short-time autocorrelation function, determining the length of a time period corresponding to a single frame of voice according to the period, and performing framing operation to obtain the audio signal of each period;

c) carrying out fast Fourier transform operation on the audio signal of each period to obtain a corresponding linear spectrogram X [ m ]]＝H[m]E[m]Processing the linear spectrogram by using a Mel filter bank to obtain a nonlinear Mel spectrum which embodies human auditory characteristics; for the Mel frequency spectrum, only two sides of the amplitude are considered to take logarithm, the logarithm base is 10, namely the Mel spectrogram log | | X [ m |)]||＝log||H[m]||+log||E[m]I, then taking inverse Fourier transform on two sides, i.e. x [ m ]]＝h[m]+e[m]The inverse Fourier transform is realized by discrete cosine transform, and the 2 nd to 13 th coefficients after DCT are taken as MFCC coefficients; the MFCC coefficient of each frame in the audio signal reflects the feature of the frame speech to obtain a modified mouth motion feature sequence C_new＝{c₁，c₂，c₃，.....c_MAnd M is the acquisition times of the corrected mouth movement information.

Preferably, the seventh step comprises the steps of:

step 1, correcting the mouth movement sequence C_newCarrying out feature extraction, adjusting feature values by using the speed of speech, and comparing the feature values with a standard mouth motion feature sequence to obtain a motion similarity index S;

step 2, extracting mouth motion sequence C_newThe feature matrix H is constructed by using the feature values of (a), and the specific method for constructing the feature matrix H is as follows:

a)C_neweach value c in the sequence has three characteristic values alpha₁，α₂The feature value sigma reflects the mouth angle deflection degree of the person to be tested, the larger the sigma is, the larger the mouth angle deflection degree is, the higher the probability that the person to be tested is a facial paralysis patient is, the mouth movement information at the moment is considered as key information, and larger weight should be distributed; setting an empirical threshold σ₀When it exceeds σ₀When the abnormal mouth is skewed, the corresponding mouth movement information is assigned with a large weight, and the weight assignment calculation method comprises the following steps:

wherein σ_iRepresenting the value of σ, w, in the motion information of the i-th acquisition_iA weight representing the ith motion information;

b) calculating the sequence C in combination with the assigned weights_newMiddle alpha₁，α₂Weighted mean of the characteristic values to obtain

As one of the characteristic values, to

For example, the formula is calculated as follows:

wherein alpha is₁(c_i) Representing alpha in the ith motion information₁A value;

c) calculating the sequence C_newMiddle alpha₁，α₂Variance of the characteristic value to obtain

As one of the characteristic values, to

For example, the formula is calculated as follows:

d) to this end, a feature matrix of mouth motion sequences is obtained and recorded as

Step 3, the characteristic matrix H of the collected mouth motion sequence and the characteristic matrix H of the standard mouth motion sequence obtained in advance are used₀Comparing to obtain similarity value of motion characteristics, and obtaining standard mouth action sequence by obtaining mouth sequences of multiple healthy persons in advance and obtaining characteristic matrix H by the same method₀The similarity S calculation method comprises the following steps: s | | | H-H₀||₂The greater the similarity is, the smaller the S value is, which indicates that the possibility of the facial paralysis of the person to be tested is smaller.

Preferably, the step eight comprises the following steps: obtaining the speaking definition D of the person to be tested according to the frequency distribution in the audio information, analyzing the spectrogram X [ m ] of each frame obtained in the sixth step, marking the current frame as 1 when the amplitude at the 3KHz position is highest, otherwise marking as 0, converting the audio signal of the person to be tested into a binary sequence, counting the proportion of the value 1 in the sequence, and marking as the speaking definition D of the person to be tested:

wherein, the larger the D value is, the less clear the speaking is, and the possibility of facial paralysis of the person to be tested is high.

Preferably, the step nine comprises the following steps: synthesizing the mouth movement similarity index S and the speaking definition D to obtain a facial paralysis index Pro: pro ═ ln (SD +1), Pro is compared with an empirical threshold Pro₀By comparison, when Pro > Pro₀When the test person is facial paralysis patient, Pro is taken₀The implementer changes the threshold value according to the actual operating conditions to 1.6.

Compared with the prior art, the facial paralysis detection method based on visual perception and audio information has the following advantages: firstly, a 3D-landmark of a human face is obtained by using an RGBD image, and a person to be measured is appointed to read specified characters. The mouth movement condition and the audio information of the person to be detected are collected, the feature sequence of the mouth movement is extracted, the feature sequence of the mouth movement is corrected by the audio information, and compared with a single action, the motion sequence can reflect the real condition of a patient. And detecting the mouth movement sequence and the audio information of the person to be detected, comparing the corrected mouth movement characteristic sequence with the standard movement sequence, and comprehensively detecting the mouth movement flexibility and the mouth tooth definition of the person to be detected by combining the frequency distribution condition of the audio information to obtain the facial paralysis result of the person. In consideration of the situation that repeated reading occurs when a person to be detected reads, the motion sequence of the mouth is corrected by using the audio information, so that the motion sequence of the mouth during normal reading is obtained, and the detection accuracy is improved.

Drawings

FIG. 1 is a schematic workflow diagram of the present invention.

Detailed Description

The facial paralysis detection method based on visual perception and audio information of the invention is further explained with the accompanying drawings and the detailed implementation mode: as shown in fig. 1, the present embodiment includes the following steps: the method comprises the following steps: an RGBD camera is arranged above a display screen to collect face RGB images and Depth images (Depth images) of people to be detected, the camera is in front view at the view angle, and the view field covers the whole area of the face. Meanwhile, the camera can collect the audio information of the person to be tested.

Step two: RGB image I for obtaining human face area by using RGB image₁And a depth map I₂；

The purpose of this step is to eliminate the influence of the background on the detection of the face region and isolate the irrelevant working conditions. The specific operation method comprises the following steps: sending the collected RGB image into a semantic segmentation network to obtain a Mask of the face region, directly multiplying the Mask image with the collected RGB image to obtain an image I of the face region₁(ii) a Multiplying the depth image to obtain a depth image I of the face area₂And the semantic segmentation network uses images acquired by a camera as a training set during training. And (4) artificially labeling the training set, labeling the pixel points of the face region to be 1, and labeling the pixel points of other regions to be 0, so as to obtain labeling data. And training the network by using the training set and the labeled data, wherein the loss function adopts a cross entropy function, and the parameters in the model are continuously updated. Implementers can use semantics split networks such as U-Net, DeepLabv3+, and the like. When the network is used, the collected RGB image is directly sent to a semantic segmentation network, the output Mask is a binary image, the pixel value of the face area is 1, and the pixel values of other areas are 0.

Step three: for image I₁Performing key point to obtain 2D-landmark of human face, and combining with depth map I₂Obtaining 3D-landmark of the human face;

in the step, firstly, the image I is subjected to face key point detection to obtain the 2D-landmark, and a plurality of methods for obtaining the face 2D-landmark are available, are known technologies and are not in the protection range of the invention, and an implementer can select Op according to actual conditionsThe enFace network, the DAN network and other prior art. Due to depth map I₂And face RGB image I₁The pixels are aligned, so that the depth values corresponding to all points in the 2D-landmark can be further directly obtained, the 3D-landmark of the face can be obtained, and the subsequent determination of the characteristic value of the mouth movement of the person to be detected is facilitated.

Step four: displaying appointed characters on a display screen, and guiding a person to be tested to read the characters on the screen; in order to facilitate the follow-up analysis of the audio information of the person to be tested, repeated characters cannot appear in the specified read characters, the follow-up analysis difficulty of the audio is reduced, and the accuracy is improved.

Step five: collecting mouth movement conditions when a person reads characters, and determining characteristic values based on key points of a mouth in 3D-landmark to obtain a characteristic sequence of mouth movement;

according to the priori knowledge, when people read characters, the position of the key point of the nose tip is kept unchanged, and the movement condition of the key point of the mouth corner can reflect the movement characteristics of the mouth. The key point of the nose tip is taken as the root node and is marked as P₀And the key point of the left and right mouth corners is marked as P₁，P₂The center points of the upper and lower lips are denoted as P₃，P₄By root node P₀As a starting point, the rest points are connected to obtain four directions. Vector P₀P₁And P₀P₂The change of the included angle reflects the change condition of the mouth angle, and the included angle of the two vectors is recorded as alpha₁Meanwhile, the difference value of the two vector modes can reflect the deflection degree of the mouth, and the mode length difference value of the two vectors is recorded as sigma | | | P₀P₁|-|P₀P₂L. Vector P₀P₃And P₀P₄The included angle reflects the motion condition of the upper lip and the lower lip, and the included angle of the two vectors is recorded as alpha₂(ii) a To this end, a characteristic value { alpha ] of the angular movement of the mouth is determined₁，α₂，σ}。

In order to reduce the pressure of data storage on hardware, the invention sets the sampling frequency to collect motion information once every two frames. Three characteristic values of mouth movement are extracted each time to obtain the shape of [3, N]Wherein N represents the collected motion informationThe number of times. The sequence is denoted as C ═ C₁，c₂，c₃，.....c_NWhere c is 3 rows and 1 columns, containing three values characteristic of mouth angular movement.

Step six: the step five is carried out simultaneously, audio information is collected and is subjected to framing and windowing operation, MFCC (Mel cepstrum coefficient) of each frame is extracted, and the mouth motion sequence obtained in the step four is corrected in time; the phenomenon that facial paralysis patients have unclear mouth and teeth and language disorder during reading is considered, and repeated reading can occur due to the influence of psychological factors. When repeated reading is carried out, repeated actions can occur in mouth movement, and influence is caused on subsequent analysis, so that feature extraction is carried out on audio information to correct the mouth movement feature sequence C.

Performing a series of operations such as framing, windowing, and FFT on the acquired audio information to obtain a mel-frequency cepstrum coefficient (MFCC) of each frame of audio, where the coefficient may reflect the characteristics of the audio signal, and the specific MFCC obtaining method is as follows:

d) in order to improve the high frequency portion, the spectrum of the signal is flattened, and the spectrum can be obtained with the same signal-to-noise ratio while maintaining the entire frequency band from low frequency to high frequency. Meanwhile, in order to highlight the formants of high frequency, pre-emphasis processing is adopted for the collected audio signals. The formula of pre-emphasis is H (Z) ═ 1-muz^-1Where Z is the collected speech signal and the value of μ is typically 0.97.

e) The audio information has the characteristic of being stable when being long and short, the framing operation is further carried out on the audio information, the period of the audio signal is determined by utilizing the short-time autocorrelation function, the length of the time period corresponding to the single-frame voice is determined according to the period, and the framing operation is carried out to obtain the audio signal of each period. Short-time autocorrelation functions are well known in the art and are not within the scope of the present invention. And will not be described in detail herein.

f) And performing FFT (fast Fourier transform) operation on the audio signal of each period to obtain a corresponding linear spectrogram X [ m ] ═ H [ m ] E [ m ], and processing the linear spectrogram by using a Mel filter bank to obtain a nonlinear Mel spectrum which can embody human auditory characteristics.

g) For Mel frequency spectrum, only two sides of amplitude are considered to be logarithmic, the logarithmic base is 10, namely Mel spectrogram

log||X[m]||＝log||H[m]||+log||E[m]||，

And taking inverse fourier transform on two sides, namely x [ m ] ═ h [ m ] + e [ m ], wherein in actual operation, the inverse fourier transform is generally realized by DCT (discrete cosine transform), and taking the 2 nd to 13 th coefficients after DCT as MFCC coefficients.

h) To this end, MFCC coefficients are obtained for each frame in the audio signal, which coefficients may reflect the characteristics of the frame of speech.

Because repeated characters do not exist in the characters on the display screen, when the fact that the MFCC coefficient of the current frame in the audio information is the same as the MFCC coefficient of the historical frame is detected, the fact that the person to be tested reads repeatedly is indicated. This requires a correction to the mouth motion sequence.

It should be noted that the acquisition of the mouth movement sequence and the acquisition of the audio information are performed simultaneously, so that the two are aligned in time sequence. And detecting repeated reading along with the corresponding time period by using the audio information, and eliminating the mouth motion sequence in the corresponding time period to obtain the corrected mouth motion characteristic sequence. Recording the sequence of the corrected mouth movement characteristics as C_new＝{c₁，c₂，C₃，......c_MAnd M is the acquisition times of the corrected mouth movement information.

It should be noted that the sampling frequency of the mouth motion is once every two frames, and the audio information is once sampled in one period, and because the sampling frequencies are different, one frame of audio information may correspond to a plurality of mouth motion feature values.

Step seven: the corrected mouth movement sequence C_newCarrying out feature extraction, adjusting feature values by using the speed of speech, and comparing the feature values with a standard mouth motion feature sequence to obtain a motion similarity index S;

extracting mouth motion sequence C_newThe feature matrix H is constructed by using the feature values of (a), and the specific method for constructing the feature matrix H is as follows:

e)C_neweach value c in the sequence has three characteristic values alpha₁，α₂And σ. The characteristic value sigma can reflect the mouth angle deflection degree of the person to be measured. The larger the sigma is, the larger the degree of mouth angle deflection is, and the larger the probability that the person to be measured is a facial paralysis patient is, the larger the weight should be assigned because the mouth movement information at that time is considered as key information. Setting an empirical threshold σ₀When it exceeds σ₀When the abnormal mouth is skewed, the corresponding mouth movement information is assigned with a large weight, and the weight assignment calculation method comprises the following steps:

wherein σ_iRepresenting the value of σ, w, in the motion information of the i-th acquisition_iRepresenting the weight of the ith motion information.

f) Calculating the sequence C in combination with the assigned weights_newMiddle alpha₁，α₂Weighted mean of the characteristic values to obtain

As one of the characteristic values, to

For example, the formula is calculated as follows:

wherein alpha is₁(c_i) Representing alpha in the ith motion information₁The value is obtained.

g) Calculating the sequence C_newMiddle alpha₁，α₂Variance of the characteristic value to obtain

As one of the characteristic values, to

For example, the formula is calculated as follows:

h) to this end, a feature matrix of mouth motion sequences is obtained and recorded as

The reason why the mean value and the variance are selected to construct the feature matrix is explained here, and the mean value and the variance reflect the average size of the feature values and the dispersion degree of distribution, and reflect the distribution characteristics of data. The method has small influence on the information acquisition time and times, is insensitive to the speed, avoids the difference caused by different speeds of speech and is convenient for subsequent similarity comparison.

The feature matrix H of the collected mouth motion sequence and the feature matrix H of the standard mouth motion sequence obtained in advance are combined₀And comparing to obtain a similarity value of the motion characteristics. Obtaining standard mouth action sequence can obtain mouth sequences of a plurality of healthy persons in advance, and obtain a characteristic matrix H by the same method₀. Here, a similarity S calculation method is given: s | | | H-H₀||₂The greater the similarity is, the smaller the S value is, which indicates that the possibility of the facial paralysis of the person to be tested is smaller.

Step eight: analyzing the frequency distribution of the audio to obtain the speaking definition D of the person to be tested;

according to the priori, the fact that the recognition sound of speaking can be masked due to the fact that the amplitude of the 3KHz position in the audio information is too high, namely, the mouth and teeth are unclear, lip sounds m, b and v are difficult to distinguish, and speaking definition D of a person to be detected can be obtained according to frequency distribution in the audio information.

For the spectrogram X [ m ] of each frame obtained in the sixth step]Analyzing, when the amplitude at the position of 3KHz is highest, marking the current frame as 1, otherwise marking the current frame as 0, converting the audio signal of the person to be tested into a sequence with the shape of 001110011.

The larger the D value is, the less clear the speaking is, and the possibility of facial paralysis of the person to be tested is high.

Step nine: and (5) integrating the motion similarity index S and the speaking definition D to obtain a facial paralysis detection result.

In view of the fact that facial paralysis patients have stiff movements, can not normally control facial muscles, stuttering and language disorder, the invention synthesizes the mouth movement similarity index S and the speaking clarity D to obtain the facial paralysis index Pro: pro ═ ln (SD +1),

pro is compared with an empirical threshold Pro₀By comparison, when Pro > Pro₀And then, the person to be tested is the facial paralysis patient. Pro in the invention₀The implementer may change the threshold value according to the actual operating conditions 1.6.

Claims

1. A facial paralysis detection method based on visual perception and audio information is characterized in that: comprises the following steps:

the method comprises the following steps: an RGBD camera is arranged above a display screen, an RGB image and a Depth image are collected, the visual angle of the camera is in front view of a human face, and audio information of a person to be detected is collected;

Step three: for image I₁Obtaining 2D-landmark of human face by using key point detection network, and combining with depth map I₂Obtaining 3D-landmark of the human face;

step four: displaying appointed characters on a display screen, and guiding a person to be tested to read the characters on the screen;

step five: collecting mouth movement conditions of a person to be tested when reading characters, and appointing a characteristic value reflecting the mouth movement to obtain a characteristic sequence of the mouth movement;

step six: simultaneously carrying out the fifth step, collecting audio information, carrying out framing and windowing operation, extracting the Mel cepstrum coefficient of each frame, and correcting the mouth motion sequence;

step seven: the corrected mouth movement sequence C_newTo carry outExtracting characteristics, namely adjusting characteristic values by using the speed of speech, and comparing the characteristic values with a standard mouth movement characteristic sequence to obtain a movement similarity index S;

2. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the second step comprises the following steps:

3. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the third step comprises the following steps:

performing face key point detection on the image I to obtain 2D-landmark, and selecting an Openface network and a DAN network according to actual conditions to realize the face key point detection, wherein the depth map I is used for₂And face RGB image I₁Are aligned on the pixel, and further obtain the corresponding depth value of each point in the 2D-landmarkAnd obtaining the 3D-landmark of the human face.

4. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the fifth step comprises the following steps:

the key point of the nose tip is taken as the root node and is marked as P₀And the key point of the left and right mouth corners is marked as P₁，P₂The center points of the upper and lower lips are denoted as P₃，P₄By root node P₀As a starting point, connecting the other points to obtain a vector P in four directions₀P₁And P₀P₂The change of the included angle reflects the change condition of the mouth angle, and the included angle of the two vectors is recorded as alpha₁The difference between the two vector modes reflects the degree of mouth skewness, and the mode length difference of the two vectors is recorded as sigma | | P₀P₁|-|P₀P₂L; vector P₀P₃And P₀P₄The included angle reflects the motion condition of the upper lip and the lower lip, and the included angle of the two vectors is recorded as alpha₂(ii) a Determining a characteristic value of angular movement of the mouth { alpha }₁，α₂σ, extracting three characteristic values of mouth movement each time to obtain the shape of [3, N ]]Wherein N represents the number of times the motion information is acquired, and the sequence is denoted as C ═ C₁，c₂，c₃，......c_NWhere c is 3 rows and 1 columns, containing three values characteristic of mouth angular movement.

5. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the method for acquiring the MFCC in the sixth step comprises the following steps:

c) carrying out fast Fourier transform operation on the audio signal of each period to obtain a corresponding linear spectrogram X [ m ]]＝H[m]E[m]Processing the linear spectrogram by using a Mel filter bank to obtain a nonlinear Mel spectrum which embodies human auditory characteristics; for the Mel frequency spectrum, only two sides of the amplitude are considered to take logarithm, the logarithm base is 10, namely the Mel spectrogram log | | X [ m |)]||＝log||H[m]||+log||E[m]I, then taking inverse Fourier transform on two sides, i.e. x [ m ]]＝h[m]+e[m]The inverse Fourier transform is realized by discrete cosine transform, and the 2 nd to 13 th coefficients after DCT are taken as MFCC coefficients; the MFCC coefficient of each frame in the audio signal reflects the feature of the frame speech to obtain a modified mouth motion feature sequence C_new＝{c₁，c₂，c₃，......c_MAnd M is the acquisition times of the corrected mouth movement information.

6. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the seventh step comprises the following steps:

As one of the characteristic values, to

For example, the formula is calculated as follows:

As one of the characteristic values, to

For example, the formula is calculated as follows:

7. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the step eight comprises the following steps: obtaining the speaking definition D of the person to be tested according to the frequency distribution in the audio information, analyzing the spectrogram X [ m ] of each frame obtained in the sixth step, marking the current frame as 1 when the amplitude at the 3KHz position is highest, otherwise marking as 0, converting the audio signal of the person to be tested into a binary sequence, counting the proportion of the value 1 in the sequence, and marking as the speaking definition D of the person to be tested:

8. The facial paralysis detection method based on visual perception and audio information, as recited in claim 1, wherein: the ninth step comprises the following steps: synthesizing the mouth movement similarity index S and the speaking definition D to obtain a facial paralysis index Pro: pro ═ ln (SD +1), Pro is compared with an empirical threshold Pro₀By comparison, when Pro > Pro₀When the test person is facial paralysis patient, Pro is taken₀The implementer changes the threshold value according to the actual operating conditions to 1.6.