CN115985310A

CN115985310A - Dysarthria voice recognition method based on multi-stage audio-visual fusion

Info

Publication number: CN115985310A
Application number: CN202211536927.8A
Authority: CN
Inventors: 钱兆鹏; 苏小苏; 于重重
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2023-04-18

Abstract

The invention discloses a dysarthria voice recognition method based on multi-level audio-visual fusion, which introduces visual information to increase characteristics by designing two-level fusion; in the first-level fusion, visual fusion coding is carried out on each speech function area based on the movement visual signals of the facial speech function areas; in the second-stage fusion, the visual fusion coding and the acoustic features are fused to form the dysarthric speech recognition of audio-visual fusion, so that the method is more suitable for dysarthric speech; the method can reduce the dysarthria voice recognition cost and improve the dysarthria voice recognition precision.

Description

Dysarthria voice recognition method based on multi-stage audio-visual fusion

Technical Field

The invention relates to the technical field of dysarthria voice recognition, in particular to a dysarthria voice recognition method based on multi-stage audio-visual fusion, which can be applied to recognizing the speech pronunciation of dysarthria.

Background

Dysarthria refers to the abnormal phenomena of breathing, throat phonation, resonance, dysarthria, rhythm and the like caused by muscular paralysis, weakened contractility and inaccurate or uncoordinated movement related to dysarthria due to organic lesions of nerves and muscles. Dysarthria causes inaccurate pronunciation, slow speech speed and low volume and definition of speakers, so that dysarthria people are difficult to communicate with other people through speech, the communication efficiency is low, and troubles and great inconvenience are caused to daily life.

Automatic speech recognition can significantly improve communication efficiency, and therefore, a great deal of research has been conducted in the field of dysarthria automatic speech recognition. However, machine learning model training in automatic speech recognition systems is not sufficient because dysarthric speech is difficult to acquire, which results in a sparse sample size of the dysarthric speech data set. To address the problem of inadequate model training due to the scarcity of dysarthric data samples, dysarthric speech may be generated using healthy speech. The method can supplement dysarthria data to a certain extent, and the generated voice is similar to real dysarthria voice in acoustics and perception, but the method is not enough to effectively improve the generalization capability of the model, and the formulation of the rule also depends on domain knowledge, so that the method is difficult to be generally used among a plurality of data sets.

In recent years, the audio-visual fusion method is applied to speech recognition, and human perception of the speech is influenced by vision according to the McGurk effect principle, so that the audio-visual fusion model can effectively provide visual information for an automatic speech recognition task, and the recognition accuracy is improved. The voice uttered by human is the result of the coordinated movement of the organs of pronunciation, among which the contribution of the tongue, lips, teeth, nose and other organs is the most prominent. The movement data of the pronunciation organ is also applied to the dysarthria automatic voice recognition task, and good results are obtained. However, the acquisition of the movement data of the vocal organs of the dysarthric person by using the sensor is high in cost, and the currently adopted audiovisual fusion method has the problem of small data volume, so that the model obtained by training still has the problem of low generalization performance, and the efficiency and the accuracy of the dysarthric speech recognition are low.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a dysarthric speech recognition method based on multi-stage audio-visual fusion, which can reduce the dysarthric speech recognition cost and improve the dysarthric speech recognition precision.

The invention introduces visual information to add features by designing a two-level fusion architecture. In the first-level fusion framework, the invention provides that the movement visual signals of the face speech function area of the dysarthric speaker are used as clues, and visual fusion coding is carried out on each speech function area, which is different from the traditional method that lip movement is used as clues of an audio-visual fusion system, and the cost is reduced by using a camera to collect the face visual signals to replace the movement data of the vocal organs collected by a sensor. In the second-level fusion framework, the visual fusion coding and the acoustic features are fused to form the audiovisual fusion dysarthric speech recognition, so that the method is more suitable for dysarthric speech.

The technical scheme provided by the invention is as follows:

a dysarthria voice recognition method based on multi-level audio-visual fusion comprises the following steps:

s1, obtaining audio-visual data, wherein the audio-visual data comprises: the shot face movement video and the voice data synchronous with the video when the dysarthria pronounces;

and S2, framing the facial motion video to obtain a dysarthric person pronunciation image, and then defining and dividing facial speech function areas based on the image. Secondly, performing primary visual fusion coding on the facial speech function areas, taking facial motion videos (motion visual signals of the facial speech function areas) of the dysarthria as clues, and performing visual fusion coding (namely primary fusion coding) on each facial speech function area;

when the method is specifically implemented, a primary visual fusion coding module is constructed, and the method comprises the following steps:

s2.1 feature extraction module

For extracting different features of the source image.

S2.2 feature fusion module

And cascading the extracted image features to obtain a fusion feature.

S2.3 image reconstruction module

And reconstructing the image of the fused feature, fusing the extracted texture detail information into the extracted spatial information by adopting a dense connection method, and finally obtaining the fused visual feature, namely the visual fusion image.

S3, extracting and aligning audio-visual features, wherein the audio-visual features comprise visual fusion image features of a face speech functional area and dysarthric voice acoustic features when a dysarthric person pronounces; obtaining aligned audio-visual features;

s3.1, extracting visual fusion image characteristics of dysarthria in pronunciation;

in the specific implementation of the invention, the ResNet-18 network is used for extracting the directional gradient histogram characteristics of the image;

s3.2, extracting dysarthric voice acoustic features, wherein Mel language spectrum parameter vectors can be adopted;

and S3.3, aligning the acoustic features of the dysarthria voice and the visual fusion image features during pronunciation, and simultaneously corresponding a pronunciation video and a voice segment to the pronunciation phoneme to obtain the aligned audio-visual features.

S4, performing dysarthria voice recognition through audio-visual secondary fusion to obtain a string of phoneme characters; the method comprises the following steps:

and S4.1, fusing the aligned audio-visual features, namely the visual fusion image features and the dysarthria voice acoustic features when the dysarthria pronounces to obtain a fusion feature parameter matrix of the voice and the video. And according to the obtained fusion characteristic parameter matrix, obtaining a mapping relation from the audio-visual fusion characteristics to the phoneme characters by training a deep time sequence neural network mapping model.

In one embodiment, the invention employs a transform-CTC and transform-S2S deep time series neural network with a sequence-to-sequence (S2S) transform model that links time classes (transform-CTC, TM-CTC). In the training phase, a linear combination of Connection Time Classification (CTC) and Sequence to Sequence (S2S) target is selected as an objective function:

L＝αlogp _ctc (y|x)+(1-α)logp _s2s (y|x) (1)

wherein, x = (x) ₁ ，...，x _T ) For the input audio-visual fusion feature parameter matrix y = (y) ₁ ，...，y _L ) For the output phoneme character p _ctc (y | x) is the conditional probability of the CTC model, p _s2s (y | x) is the S2S model conditional probability, L is the combined objective function of CTC and S2S, and α is the relative weight of CTC and S2S. The value ranges of x and y are T and L, respectively.

And S4.2, decoding the obtained phoneme characters.

In the decoding process, the RNN language model p is used _LM (y):

logp ^* (y|x)＝αlogp _ctc (y|x)+(1-α)log _s2s (y|x)+θlogp _LM (y) (2)

Where θ is a parameter that controls the contribution of the language model.

Through the steps, the dysarthric voice recognition based on the multi-stage audio-visual fusion is realized.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a method for making vision fusion coding on the face speech function area of a dysarthric speaker by taking the movement vision signal of the face speech function area as a clue and using a CNN network, which is different from the traditional method which simply depends on lip movement as a clue of an audio-visual fusion system; the two-stage fusion architecture is designed, firstly, the motion visual information of the facial speech functional area is subjected to first-stage fusion, and then, the second-stage fusion is carried out on the visual information and the auditory information by using the Transformer-CTC and Transformer-S2S architectures, so that the whole audio-visual fusion architecture can capture sufficient visual information and auditory information, and the audio-visual fusion architecture is more suitable for dysarthria voice.

Drawings

Fig. 1 is a flow chart of a dysarthric speech recognition method based on multi-level audio-visual fusion according to an embodiment of the present invention.

Fig. 2 is a flow chart of a primary visual fusion encoding method for a facial speech function region in an embodiment of the present invention.

FIG. 3 is a block diagram of a method for detecting-arbitrating multi-part facial speech function regions using face detectors in the dlib library in accordance with one embodiment of the invention

FIG. 4 is a diagram of the location of the six muscles most closely associated with speech in accordance with an embodiment of the present invention

FIG. 5 is a block diagram of a process flow of a method for performing primary visual fusion coding on each facial speech function region in accordance with an embodiment of the present invention.

FIG. 6 is a flow chart of a method for audio-visual two-level fusion dysarthria speech recognition according to an embodiment of the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The invention provides a dysarthria voice recognition method based on multi-level audio-visual fusion, which comprises the following steps:

step S1, obtaining audio-visual data, wherein the audio-visual data comprises: the face movement video and the voice data synchronous with the video are shot when the dysarthria pronounces;

s2, performing primary visual fusion coding on the face speech function area, taking a movement visual signal of the face speech function area of the dysarthric speaker as a clue, and performing visual fusion coding on each face speech function area; the method comprises the following steps:

s2.1 defining facial speech function area

From an anatomical perspective, the superior lip levator, superior orbicularis muscle, inferior lip pressure reducing muscle, labial ostia reducing muscle, and genius muscle contribute most to speech function. Therefore, lip, chin, left and right palate areas and a plurality of (5) partial areas of the nose are selected as facial speech function areas for visual fusion coding.

S2.2 cutting out and obtaining facial speech function areas of a plurality of parts

The captured video is framed and the source image (facial image) can be obtained frame by frame. From the plurality of partial facial speech function areas divided in S2.1, in one of the embodiments, 5 partial images are detected-extracted using a face detector in a machine learning library (dlib library).

S2.3 performing primary vision fusion coding on each facial speech function area

And (4) carrying out image fusion on the plurality of part of facial speech function areas intercepted in the S2.2, wherein the pixel prediction of the image needs to consider not only low-frequency information in the image, such as texture details, colors and the like, but also high-frequency information of the image, such as spatial information. When the method is specifically implemented, a primary visual fusion coding module is constructed, and the method comprises the following steps:

s2.3.1 constructs a feature extraction module, and image features are obtained by adopting CNN extraction;

the method is used for extracting different features of the source image and extracting the features such as texture details, color and spatial information of the source image.

S2.3.2 feature fusion module

And cascading the extracted image features to obtain fusion features.

S2.3.3 image reconstruction module

S3, extracting and aligning audio-visual features, wherein the audio-visual features comprise the visual fusion image features of the face speech functional area and the voice acoustic features of the dysarthric person when the dysarthric person pronounces, which are obtained in the step 2;

s3.1, extracting visual fusion images to obtain visual fusion image characteristics of dysarthria when pronouncing;

in the specific implementation of the invention, the visual fusion image is obtained according to the step S2, then the image features are extracted, and in one embodiment, the directional gradient histogram features of the image are extracted by using a ResNet-18 network;

s3.2, extracting acoustic features of dysarthria voice, wherein Mel language spectrum parameter vectors can be adopted;

in the specific implementation of the invention, the acoustic characteristics adopt Mel language spectrum parameters, and the specific calculation method is as follows;

according to the voice waveform signal, carrying out short-time Fourier transform, and then calculating by a Mel filter bank to obtain Mel spectrum parameters;

and S3.3, aligning the acoustic features of the dysarthria voice and the visual fusion image features during pronunciation, wherein the pronunciation phonemes simultaneously correspond to a pronunciation video and a speech fragment.

Step S4, performing dysarthric speech recognition through audio-visual two-stage fusion to obtain a string of phoneme characters, and the method comprises the following steps:

and fusing the aligned audio-visual features, namely the visual fusion image features and the dysarthria voice acoustic features when the dysarthria pronounces to obtain a fusion feature parameter matrix of the voice and the video. And according to the obtained fusion characteristic parameter matrix, obtaining a mapping relation from the audio-visual fusion characteristics to the phoneme characters by training a deep time sequence neural network mapping model. In one embodiment, the present invention employs a transform-CTC and transform-S2S deep time-series neural network. In the training phase, a linear combination of Connection Time Classification (CTC) and Sequence to Sequence (S2S) targets is selected as the objective function:

L＝αlogp _ctc (y|x)+(1-α)logp _s2s (y|x) (1)

wherein, x = (x) ₁ ，...，x _T ) For the input audio-visual fusion feature parameter matrix y = (y) ₁ ，...，y _L ) For the output phoneme character p _ctc (y | x) is the CTC model (conditional probability model), p _s2s (y | x) is an S2S model (conditional probability model), and the vertical line represents the conditional probability; l is the combined objective function of CTC and S2S, and α is the relative weight of CTC and S2S.

S4.2, decoding and identifying the phoneme to obtain the dysarthria voice.

In the decoding process, the RNN language model p is used _LM (y):

logp ^* (y|x)＝αlogp _ctc (y|x)+(1-α)logp _s2s (y|x)+θlogp _LM (y) (2)

Where θ is a parameter that controls the contribution of the language model.

Through the steps, the dysarthria voice recognition method based on the multi-level audio-visual fusion is realized.

As shown in fig. 1, in an embodiment of the present invention, a dysarthric speech recognition method based on multi-stage audiovisual fusion includes four main steps:

step S1, obtaining audio-visual data, including a face movement video and voice data synchronous with the video when the dysarthria speaks. The dysarthria data set uaspech contains 102.7 hour voice recordings of 29 speakers, and the pronunciation texts are all isolated words including numbers, computer instructions, radio letters, common and uncommon words. Of the 29 speakers, 16 were dysarthric, the remaining 13 were healthy control groups, and only 8 of the 16 dysarthric had audiovisual data at the same time. The data set was collected in a laboratory setting and the subject read the words from a computer display. A7-channel microphone is used for recording audio data, an 8-channel microphone is used for recording dual-tone multi-frequency signals, and a digital camera is used for recording video data. The speech intelligibility score of UASpeech was calculated from the average score of 5 native speakers in the hearing test. Intelligibility scores were between 2% and 95%. Dysarthria patients were divided into four groups based on speech intelligibility scores, i.e. 0-25% for the very low group, 25-50% for the low group, 50-75% for the medium group, 75-100% for the high group, corresponding dysarthria levels were: extreme severe (very low), severe (low), moderate (mil), mild (high).

And S2, performing primary fusion on the facial speech function areas, namely dividing 5 parts of facial speech function areas according to physiological knowledge, detecting and extracting the 5 parts of images by using a human face detector in a dlib library in one embodiment, and finally performing visual fusion on each speech function area by using a CNN (CNN) network in one embodiment.

And S3, extracting and aligning the audio-visual features, including the visual fusion image features of the facial speech function area and the dysarthric voice acoustic features when the dysarthric person pronounces and aligning. In one embodiment, 39-dimensional Mel spectral feature parameters are extracted from the original waveform in 10-millisecond steps, and images are extracted from the video at a sampling rate of 25 Hz. This text then concatenates 4 consecutive acoustic features into one frame, which is then input into the model, so that the acoustic and visual feature inputs are of the same length.

And S4, audio-visual two-stage fusion, wherein the extracted audio-visual data features including visual fusion image features and dysarthria voice acoustic features during pronunciation of dysarthria patients are fused, and output is decoded.

Fig. 2 shows a flow of a primary visual fusion coding method performed in a facial speech function area (step S2), including:

s2.1 defining areas of facial speech function

S2.2 cutting out the facial speech function areas of a plurality of parts

Fig. 3 shows that the multi-part facial speech function region is detected and extracted by using the face detector in the dlib library, and the method flow comprises the steps of face detection, key point detection, face alignment and facial speech function region extraction by using the dlib.

dlib face detection: the face image of the video frame is detected using a detector dlib in dlib, get _ front _ face _ detector ().

dlib key point detection: and detecting key points of the face based on 68 feature point detectors shape _ predictor _68 face _ landworks.dat of the face of the dlib library.

dlib face alignment: the face alignment operation can enable the subsequent model to extract features which are irrelevant to the positions of the five sense organs and only relevant to the shape textures of the five sense organs.

dlib tailored 5 parts of the facial speech function: and (4) according to the positions of the key points, cutting the image according to the required lip, chin, left and right palate areas and the nose 5 partial area.

FIG. 4 is a diagram of the location of the six muscles most closely associated with speech in accordance with an embodiment of the present invention. From an anatomical perspective, the superior labial levator (levator labii superior), the superior orbicularis muscle (orbicularis sus sur-perior), the inferior orbicularis muscle (orbicularis ori preferior), the inferior labial pressure-reducing muscle (depressor labii preferioris), the labial lip lowering muscle (depressor anguli oris), and the mental muscle (mental) contribute most to the speech function. LLS: upper lip levator, OOS: orbicularis superior, OOI: lower orbicularis muscle, DLI: lower lip pressure-lowering muscle, DAO: lip muscles are lowered in the angle of the mouth, M: the chin muscle.

According to the invention, a visual fusion convolutional neural network model is constructed, and primary visual fusion coding is carried out according to each facial speech function area. Fig. 5 shows a flow of a method for performing primary visual fusion coding according to each facial speech function region. In FIG. 5, I _A 、I _B Representing a source image I _F Representing the fused image, the visual fusion neural network model constructed in the invention comprises a series of convolution layers, wherein the convolution layer operation can be defined as Y _i ＝F _i (X _i )，X _i 、Y _i Respectively representing the input facial speech function area image features and the output fusion image features of the i-th convolutional layer, F _i Represents the convolution operation of the ith convolution layer; i =1.. K. The visual fusion convolution neural network model is defined as follows:

Y _i ＝F _k ⊙F _k-1 …⊙F ₂ ⊙F ₁ (I)＝⊙ _i＝1...k F _i (I) (3)

wherein, I is a source image, the convolution layer of the fusion convolution neural network model adopts convolution kernels of 3 multiplied by 3 and 1 multiplied by 1, and the step length is 1. The network model does not use a fully connected layer, so the input image can be of any size. Except that the last convolutional layer is activated by the Tanh function, the other convolutional layer activation functions are the ReLU functions. The method for constructing the visual fusion convolutional neural network model to perform the primary visual fusion coding comprises the following steps:

s231, constructing a feature extraction module for extracting different features of the source image;

according to the number of the source images, the feature extraction module comprises a plurality of branches; the number of the feature extraction branch modules is equal to the number of the source images, namely when the number of the source images is k, the feature extraction module consists of k branches.

The input of each feature extraction branch module is respectively defined as

And

the output is->

k is the number of source images and i is the number of convolutional layers. As shown in formula (2), each branch contains three convolutional layers for extracting features of texture detail, color and spatial information of the source image. I is _k For a source image, the characteristics of an output image obtained by extraction of k characteristic extraction branch modules are respectively expressed as:

s232, constructing a feature fusion module for fusing different features extracted from the source image by the feature extraction module;

in the feature fusion module part, the outputs of the k feature extraction branch modules

k∈[1，k]，i∈[1，3]Then, cascade connection is carried out to obtain a fusion image characteristic c ₁ As shown in formula (3), c ₁ As an input to the image reconstruction module, concate represents the cascade fusion.

S233, constructing an image reconstruction module for reconstructing a fusion image;

the image reconstruction module part of the visual fusion convolutional neural network model comprises 8 convolutional layers defined as

i is the number of convolution layers, c ₁ As input to an image reconstruction module, c ₂ 、c ₃ As an intermediate feature of the image reconstruction module. The first layer of convolution layer and the third layer of convolution layer in the characteristic extraction module are respectively connected to the third layer of convolution layer and the fifth layer of convolution layer of the image reconstruction module through close connection, and the reconstructed fusion image is represented as the following formula. Therefore, information of different characteristic layers can be fully utilized, and a good fusion effect is obtained.

/>

And finally obtaining the fused visual feature Fusion Image.

Fig. 6 shows a flow of the audio-visual two-stage fusion dysarthria speech recognition method, which includes:

s41, audio-visual features are extracted, and image features and acoustic features are aligned, namely visual fusion image features of facial speech function areas and acoustic features of dysarthria voice are aligned when a dysarthric person pronounces voice. For videos, the directional gradient histogram features of the images after fusion of facial speech function areas are obtained, for audios, mel-language spectral parameters are used as acoustic features, the calculation process comprises the steps that voice time domain waveform signals are subjected to short-time Fourier transform calculation to obtain time-frequency analysis-language spectral parameters, and then 39-dimensional Mel-language spectral parameters are obtained through calculation of a Mel filter bank. The present invention aligns acoustic and visual features at the frame level. The method extracts 26-dimensional log filter bank energy characteristics from the original waveform in steps of 10 milliseconds, and extracts images from the video with a sampling rate of 25 Hz. The invention then concatenates the 4 consecutive acoustic features into a frame and then inputs them into the model so that the acoustic and visual feature inputs are the same length.

S42 audiovisual secondary fusion. In one embodiment, the fusion method is a stitching method. Splicing the visual fusion image characteristic parameter vector and the voice acoustic characteristic parameter vector frame by frame to obtain a fusion characteristic parameter matrix of voice and video; in the fusion characteristic parameter matrix, the low-dimensional matrix is a voice acoustic characteristic parameter vector, and the high-dimensional matrix is a visual fusion image characteristic parameter vector.

And S5, identifying the dysarthric voice. In one embodiment, the deep time-series neural network model adopted by the speech recognition framework is a Transformer-CTC and Transformer-S2S framework, the model input is a fusion characteristic parameter matrix, and the model output is a string of phoneme characters obtained by recognition.

In the training phase, a linear combination of Connection Time Classification (CTC) and Sequence to Sequence (S2S) target is selected as an objective function:

L＝αlogp _ctc (y|x)+(1-α)logp _s2s (y|x) (1)

wherein, x = (x) ₁ ，...，x _T ) For the input audio-visual fusion feature parameter matrix y = (y) ₁ ，...，y _L ) For the output phoneme character p _ctc (y | x) is the conditional probability of the CTC model, p _s2s (y | x) is the S2S model conditional probability, L is the combined objective function of CTC and S2S, and α is the relative weight of CTC and S2S.

S6, decoding the data (the recognized string of phoneme characters) by using a language model (e.g. 4-gram language model).

And decoding the data by using a 4-gram language model trained on the text data in the LRS3 training set, and decoding the data respectively in three modes of 'auditory audio only', 'visual video only' and 'auditory-visual audio-video', wherein the perplexity of the 4-gram language model is set to be 110.5. For CTC, the invention adjusts the beam width to be between {5,10,20,50,100,150}, the language model weight to be between {0,1,2,4,8}, and the word insertion penalty to be between { + -4, + -2,0 }.

Table 1 shows the experimental results of the present invention, which performed Dysarthric Speech Recognition experiments based on multi-level Audio-visual fusion on UASpeech data set, compared with DNN (S.Liu et al, "Recent Progress in the CUHK dynamic Speech Recognition System," IEEE/ACM Transactions on Audio, speech, and Languge Processing, vol.29, pp.2267-2281,2021, doi. Systems 1 and 2 are experimental results of the method of the present invention, the visual input of system 1 is visual fusion of facial speech function zones, and the visual input of system 2 is lip movement.

TABLE 1 Audio-visual fusion dysarthria Speech recognition test results

The experimental result shows that the audio-visual fusion has higher precision than the single-mode identification. In case of mild dysarthria, the WER value of system 1 in the audio-visual modality is 4.58% lower than that of DNN, and the WER value of system 2 in the audio-visual modality is 0.84% lower than that of DNN. Similarly, the WER value of system 1 is 1.63% lower than that of DNN in the medium case, 2.31% lower than that of DNN in the heavy case, and 0.63% lower than that of DNN in the system 2 in the heavy case. In the very severe case, the WER value of system 1 was 0.42% lower than DNN, but the WER value of system 2 was 1.18% lower than DNN.

Second, comparing systems 1 and 2, it can be seen that performing zone fusion can significantly reduce the WER value. In the hearing modality only case, the WER value for the mild system 1 was 1.84 percentage points lower than for the system 2, 2.51 percentage points lower for the moderate case, and 5.8 percentage points lower for the severe case. Very severe cases are particularly the WER value of System 1 is 0.47 percentage points higher than System 1. Very severe dysarthria often accompany severe illness, with the head moving significantly when speaking, making it difficult to capture their facial area. Under the condition of audio-visual fusion, the WER values of the system 1 in light, medium and heavy situations are all lower than that of the system 2 by 3.74 percentage points, 4.00 percentage points and 1.68 percentage points respectively, and the WER value of the system 1 in heavy situations is all higher than that of the system 2 by 1.76 percentage points.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the invention and scope of the appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A dysarthria voice recognition method based on multi-stage audio-visual fusion comprises the following steps:

s1, obtaining audio-visual data, wherein the audio-visual data comprises: a face movement video and voice data synchronized with the video when the dysarthric pronounces;

s2, constructing a visual fusion convolutional neural network model to perform primary visual fusion coding, and performing visual fusion coding on each facial speech function area according to facial motion video and facial speech function areas of the dysarthric person; the method comprises the following steps:

s2.1, defining a plurality of facial speech function areas;

s2.2, framing the collected facial motion video, and obtaining facial images frame by frame; cutting a plurality of facial speech function areas; specifically, a face detector in a dlib library is used for detecting and extracting a plurality of facial speech function area images;

s2.3, constructing a primary visual fusion coding module, and performing primary visual fusion coding on each facial speech function area;

the constructed vision fusion convolutional neural network model comprises a series of convolutional layers, wherein the convolutional layer operation is defined as Y _i ＝F _i (X _i )，X _i 、Y _i Respectively representing the input facial speech function area image characteristics and the output fusion image characteristics of the ith convolutional layer; the visual fusion convolution neural network model is defined as follows:

Y _i ＝F _k ⊙F _k-1 …⊙F ₂ ⊙F ₁ (I)＝⊙ _i＝1...k F _i (I)

wherein an indicates a convolution operation; i is a source image; f _i Represents the convolution operation of the ith convolution layer; k = 1.;

the visual fusion convolution neural network model uses multilayer convolution layers; no full tie layer is used; the input image is of any size; the method comprises the following steps that (1) ReLU functions are adopted by other convolutional layer activation functions except that the last convolutional layer is activated by a Tanh function;

carrying out image fusion on the intercepted facial speech function areas, and carrying out image reconstruction on the fused features to obtain fused visual features, namely visual fusion images; the method comprises the following steps:

s2.3.1 constructing a feature extraction module for extracting different features of the source image, including texture detail features, color features and spatial information features of the source image;

s2.3.2 constructs a feature fusion module, and carries out cascade splicing on the extracted image features to obtain fusion features;

s2.3.3 constructs an image reconstruction module, performs image reconstruction on the obtained fusion characteristics, and fuses the extracted texture detail information into the extracted spatial information by adopting a dense connection method to obtain a visual fusion image;

s3, extracting and aligning audio-visual features; the audio-visual features comprise visual fusion image features and voice acoustic features of dysarthria;

s3.1, extracting visual fusion images to obtain visual fusion image characteristics during pronunciation;

s3.2, extracting dysarthric voice acoustic features;

s3.3, aligning the acoustic features of the dysarthria voices and the visual fusion image features during pronunciation, and simultaneously corresponding a section of pronunciation video and a section of voice segment to the pronunciation phonemes to obtain the aligned audio-visual features;

s4, performing dysarthria voice recognition through audio-visual secondary fusion by using the aligned audio-visual features, namely obtaining a fusion feature parameter matrix of voice and video through the audio-visual secondary fusion; according to the obtained fusion characteristic parameter matrix, obtaining a mapping relation from audio-visual fusion characteristics to phoneme characters by training a deep time sequence neural network mapping model to obtain a string of phoneme characters;

s4.2, decoding the obtained phoneme characters to obtain dysarthric voice;

2. The method of claim 1, wherein the convolution layer of the visual fusion convolutional neural network model constructed in step S2 uses convolution kernels of 3 x 3 and 1 x 1 types, and the step length is 1.

3. The method of claim 1, wherein the plurality of speech function areas defined in step S2.1 include lip area, chin area, left palate area, right palate area and nose area.

4. A dysarthria speech recognition method based on multi-stage audiovisual fusion as claimed in claim 1, characterized in that step S2.2 is to frame the collected video and obtain facial images frame by frame, specifically to use a face detector in dlib library to detect and cut out multi-part facial speech functional areas, the process includes: dlib face detection, key point detection, face alignment and face speech function region cutting;

dlib face detection: detecting a face image of the video frame by using a detector dlib in dlib, get _ front _ face _ detector ();

dlib key point detection: detecting key points of the human face based on 68 feature point detectors shape _ predictor _68 face _ landworks.dat of the human face of the dlib library;

dlib face alignment: extracting features which are irrelevant to the positions of the five sense organs and relevant to the shape textures of the five sense organs from the subsequent model through the face alignment operation;

dlib cutting facial speech function: and cutting the image according to different facial speech function areas according to the positions of the key points.

5. The method for speech recognition of dysarthria based on multi-stage audiovisual fusion according to claim 1, characterized in that in step S2.3, when the number of source images is k, the feature extraction module consists of k branches;

the input of each feature extraction branch module is respectively defined as

And

the output is->

k is a source diagramThe number of images, i, is the number of convolution layers. As shown in equation (2), where l is a convolution operation, each branch contains three convolution layers for extracting texture details, color and spatial information of the source image. The k feature extraction branch modules extract the features of the output image, and the features are respectively expressed as:

the feature fusion module extracts the output of the two feature extraction branch modules

Cascading to obtain a fused image characteristic c ₁ As shown in formula (3):

wherein, c ₁ As an input of the image reconstruction module, concate represents cascade fusion;

the image reconstruction module comprises 8 convolution layers defined as

i is the number of convolution layers, c ₁ As input to an image reconstruction module, c ₂ 、c ₃ As an intermediate feature of the image reconstruction module; the first layer of convolution layer and the third layer of convolution layer in the characteristic extraction module are respectively connected to the third layer of convolution layer and the fifth layer of convolution layer of the image reconstruction module through close connection, and the reconstructed fusion image is represented as follows:

and finally obtaining the fused visual features.

6. A dysarthric speech recognition method based on multi-level audiovisual fusion as claimed in claim 1, characterized in that step S3.1 extracts features of the visually fused image, in particular histogram of oriented gradients of the image using the ResNet-18 network.

7. The method for speech recognition based on multi-level audiovisual fusion as claimed in claim 1, wherein step S3.2 is to extract acoustic features of dysarthric speech by using mel-frequency spectral parameter vectors.

8. The dysarthric speech recognition method based on multi-stage audio-visual fusion according to claim 1, wherein the audio-visual two-stage fusion method of step S4 comprises: splicing the visual fusion image characteristic parameter vector and the voice acoustic characteristic parameter vector frame by frame to obtain a fusion characteristic parameter matrix of voice and video; a low-dimensional matrix in the fusion characteristic parameter matrix is a voice acoustic characteristic parameter vector, and a high-dimensional matrix is a visual fusion image characteristic parameter vector;

recognizing dysarthric speech includes: adopting a Transformer model with sequence-to-sequence connection time classification, and taking a fusion characteristic parameter matrix after audio-visual secondary fusion as the input of a Transformer network model; and setting an objective function of model training to carry out model training.

9. The dysarthric speech recognition method based on multi-level audio-visual fusion according to claim 8, wherein the audio-visual two-level fusion of step S4 is specifically:

adopting sequences with connection time classification to sequence model transform-CTC and depth time sequence neural network model transform-S2S; in the training phase, a linear combination of the link time classification CTC and the sequence to sequence S2S objectives is chosen as the objective function, expressed as:

L＝αlogp _ctc (y|x)+(1-α)logp _s2s (y|x)

wherein, x = (x) ₁ ，...，x _Y ) For the input audio-visual fusion feature parameter matrix y = (y) ₁ ，...，y _L ) Is the output phoneme character; p is a radical of formula _ctc (y | x) is the CTC model, p _s2s (y | x) is the S2S model, L is the combined objective function of CTC and S2S, and α is the relative weight of CTC and S2S.

10. The method of claim 8 wherein the RNN language model p is used in decoding the phonemic characters _LM (y), expressed as:

logp ^* (y|x)＝αlogp _ctc (y|x)+(1-α)logp _s2s (y|x)+θlogp _LM (y)

where θ is a parameter that controls the contribution of the language model.