CN116151226B

CN116151226B - Machine learning-based deaf-mute sign language error correction method, equipment and medium

Info

Publication number: CN116151226B
Application number: CN202211632041.3A
Authority: CN
Inventors: 梁智杰; 武锐霞; 杨娟; 吴长城; 李红霞; 冯朝胜; 王玲; 毛笋
Original assignee: Sichuan Normal University; Sichuan Water Conservancy Vocational College
Current assignee: Sichuan Normal University; Sichuan Water Conservancy Vocational College
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2024-02-23
Anticipated expiration: 2042-12-19
Also published as: CN116151226A

Abstract

The invention discloses a sign language error correction method, device and medium for a deaf-mute based on machine learning, which are characterized in that a sign language syntax corpus and a sign language syntax error detection model are established to detect and correct text data error information, the feature accuracy of sign language images is improved, the cross-modal countermeasure training is carried out on the corrected text data through establishing the unified characterization of the text data and the sign language limb language video data, the cross-modal heterogeneous gap between the video and the text is eliminated, the sign language video sample with the syntax error can be converted into corrected teaching feedback animation, the deaf-mute is guided to carry out autonomous sign language cognitive training, the sign language sample demonstrated by the deaf-mute is decomposed, and the syntax error which does not accord with the sign language rule in the sample is detected to carry out correction feedback, so that the learning interactive experience of the deaf-mute is improved.

Description

Machine learning-based deaf-mute sign language error correction method, equipment and medium

Technical Field

The invention relates to the field of data processing, in particular to a method, equipment and medium for correcting sign language of a deaf-mute based on machine learning.

Background

Sign language is the most intuitive way for the deaf-mute to express complaints and acquire information services, and is also the first language for the deaf-mute to learn in a natural state. In recent years, the explosive development of artificial intelligence (Artificial Intelligence, AI) has made intelligent sign language teaching possible. If the machine can simulate sign language teachers, the sign language expressions of the deaf-mute can be automatically understood, and error information in the sign language expressions can be searched for feedback correction, so that the workload of the sign language teachers can be reduced, and the machine can be guided by standardized actions, so that the machine can develop good expression habits in early sign language cognitive training, and a solid foundation is laid for later-stage equal participation in social life. At present, computer-aided sign language teaching research is still in a starting stage, and the intelligent sign language correction feedback is realized, so that the following two technical difficulties are still faced.

Intelligent sign language cognition assistance centered on a deaf-mute learner first requires that the machine be able to accurately recognize sign language. An early representative internationalization study was the Kids Sign Online (KSO). Aiming at the cognitive characteristics of infants, the study firstly demonstrates a section of executing sign language video for the deaf infants, then collects the imitated actions of the deaf infants, inputs the imitated actions into a sign language analyzer driven by a hidden Markov time sequence model (HMM, hidden Markov Model), and calculates the matching degree score between the imitated actions and standard sign language through comparison of two sections of sequences. In China, gao Wen and the like propose deaf-mute sign language translation teaching based on a multi-mode interface, the study utilizes a time sequence clustering algorithm to identify sign language videos, and a two-dimensional animation is initially used for teaching demonstration on Chinese sign language. However, limited to the semantic complexity of sign language video, such studies rely primarily on human experience for feature extraction of sign language images, suitable for analysis and measurement of small-scale sign language data. To solve such problems, it is necessary to extract the temporal and spatial features of sign language video synchronously, so as to implement accurate translation of sign language by a machine.

The neural network represented by deep learning provides a trigger for further understanding of the sign language of the deaf-mute by the machine, and the national natural science foundation committee in the United states has continuously sponsored the research SAIL (Signing Avatars and Immersive Learning) of using artificial intelligence and deep learning technology to combine and promote the teaching cognition of the sign language of the deaf-mute and the research EAGR (Efficient Avatars Generator Recognition) of the sign language recognition and generation of the deaf-mute in recent years. However, the above-mentioned study can only provide unidirectional sign language recognition or demonstration, and cannot provide targeted interactive guidance according to the sign language grasp condition of the individual of the deaf-mute. Based on the bidirectional features of teaching interaction, the machine needs to detect missing vocabulary and syntax errors of sign language first, and then provides accurate personalized feedback according to the cognitive situation of each learner. This involves the problem of sign language video understanding and high level syntactic reasoning, and the need to overcome the cross-modal heterogeneous gap between video and text remains a challenging technical problem in the current artificial intelligence field.

Disclosure of Invention

The invention aims to provide a sign language error correction method, device and medium for a deaf-mute based on machine learning, which are used for detecting and correcting text data error information by acquiring sign language limb language video data, establishing a sign language syntax corpus and a sign language syntax error detection model, improving the accuracy of the characteristics of the sign language image, performing cross-modal countermeasure training on the corrected text data by establishing unified cross-modal characterization of the text data and the sign language limb language video data, generating feedback video data, and eliminating the cross-modal heterogeneous gap between the video and the text.

The invention is realized by the following technical scheme:

the invention relates to a machine learning-based sign language error correction method for a deaf-mute, which comprises the following specific steps:

s1, acquiring sign language limb language video data, constructing a video space-time synchronization feature coding frame, extracting features of the video data, and translating the extracted features to obtain text data;

s2, establishing a sign language syntax corpus and a sign language syntax error detection model, pre-training the sign language syntax error detection model according to the sign language syntax corpus, detecting errors of text data by the trained sign language syntax error detection model, constructing a syntax error correction candidate set according to the sign language syntax corpus, and correcting errors of the text data;

s3, establishing cross-modal unified characterization of the text data and the sign language limb language video data, embedding different modal information into the same common vector space, constructing a sign language feedback animation generation model, and performing cross-modal countermeasure training on the corrected text data to generate feedback video data.

According to the invention, by acquiring the sign language limb language video data, establishing a sign language syntax corpus and a sign language syntax error detection model to detect and correct text data error information, improving the feature accuracy of sign language images, and by establishing the cross-modal unified characterization of the text data and the sign language limb language video data, performing cross-modal countermeasure training on the corrected text data to generate feedback video data, and eliminating the cross-modal heterogeneous gap between the video and the text. And establishing a cross-modal unified characterization model among the video, text semantics and animation, performing cross-modal countermeasure training in a binary game mode, and automatically generating corrected sign language feedback demonstration animation. According to the method, the sign language video sample with the syntactic error can be converted into the corrected teaching feedback animation, a deaf-mute learner is guided to perform autonomous sign language cognitive training, the sign language sample demonstrated by the deaf-mute learner is solved, the syntactic error which does not accord with the sign language rule in the sample is detected to be corrected and fed back, and therefore learning interaction experience of the deaf-mute is improved.

Further, the constructing the video space-time synchronization feature coding framework performs feature extraction on video data, and specifically includes:

and extracting synchronous features of the two-dimensional space domain image and the one-dimensional time domain sequence from the video data by adopting a long-short-time memory network ConvLSTM with convolution operation, wherein the space channel adopts ConvLSTM to carry out deep learning on the sign language two-dimensional image sequence, the local time domain channel ConvLSTM carries out deep learning on optical flow features, and the global time domain channel ConvLSTM carries out deep learning on video semantic actions.

Furthermore, the space channel adopts ConvLSTM to carry out deep learning on the sign language two-dimensional image sequence, and the method specifically comprises the following steps:

performing feature extraction by adopting two convolution time sequence layers, and encoding sign language video into a two-dimensional tensor containing space-time information;

obtaining the maximum value of all feature extraction of a two-dimensional tensor in a designated area and the window size of pooling operation in a space dimension, and determining the output of a pooling layer according to the output of a convolution time sequence layer to obtain high-level semantic features;

carrying out linear leveling operation on the space-time characteristics by adopting a normalization layer to obtain a one-dimensional linear sequence fused with the space-time characteristics;

and introducing an attention mechanism layer, extracting and screening the characteristics of the linear sequence, sending the high-level semantic characteristics into a full-connection classifier, and outputting sentences containing keywords.

Further, in the process of extracting the features by adopting the long-short-time memory network ConvLSTM, the method further comprises the following steps:

constructing a forgetting door, an input door and an output door;

the forgetting gate determines information to be reserved by the neuron at the current moment according to the current input and the output at the last moment;

the input gate controls the proportion of information added into the candidate state according to the current input, the output at the last moment and the weight of the forgetting gate so as to generate a new state;

and the output gate determines the output value at the moment according to the new state, and controls the information residual proportion of the video data according to the normalized range of the output value.

Further, the error detection of text data error information by the trained sign language syntax error detection model specifically includes:

shielding the isolated words in the trained sign language syntax error detection model, so that the model does not know whether the isolated words at the current position conform to the whole semantics, the trained sign language syntax error detection model is guided to predict the isolated words depending on the context information, and error information in text data is detected;

determining the sentence embedding vector to be corrected and the sentence output after the sign language syntax error detection model correction according to the text data error information, and uniformly segmenting the video frame by frame in the space dimension;

and converting the two-dimensional image into a one-dimensional linear sequence containing fine-granularity region codes according to slice and position codes, generating an embedded vector of a statement to be corrected and the probability of error of each position in advance, and if the value is larger than a set threshold value, indicating that the corresponding position is in error.

Further, the constructing a syntactic error correction candidate set according to the sign language syntactic corpus, and correcting the text data error information specifically includes:

feature fusion was performed using Soft-mask:

the error probability of each position is multiplied by the characteristic of the mask character to be input as a first part of an error corrector, the non-error probability is multiplied by the original input characteristic to be input as a second part, and the two parts are added to be the characteristic of each character;

training the fused features:

the method comprises the steps of randomly shielding an input original one-dimensional sequence according to set probability, enabling a model not to know whether shielding words at the current position are correct sign language isolated words, predicting original values of the shielding words according to context associated words, comparing and sequencing all candidate sets through error probability, and outputting candidate words with highest probability to fill and replace error correction.

Further, the step S3 specifically includes:

constructing a discrimination network, adopting sign language limb language video data as training data, and generating a true/false classifier by extracting the characteristic training of a real sample;

constructing a generating network, acquiring a feedback statement sequence, performing distribution transformation by using random noise, and mapping an input space to a sample space to obtain a generated sample;

training the sign language feedback animation generation model, sending the generated sample into a discrimination network for true/false discrimination and classification, and continuously iterating until the difference between the generated sample and the true sample is smaller than a threshold value, and generating feedback video data through the sign language feedback animation generation model.

Further, the generating of the feedback video data further comprises data processing of sign language limb language video data, extracting of three-dimensional skeleton coordinates and two-dimensional image limb information, and sending the three-dimensional skeleton coordinates and the two-dimensional image limb information to a generating network;

and calculating the similarity of the limb track frame by frame through a discrimination network, and generating feedback video data demonstration animation in real time.

The second aspect of the invention provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes a deaf-mute sign language error correction method based on machine learning when executing the program.

A third aspect of the present invention is a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a machine learning-based method of sign language correction for a deaf-mute.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method comprises the steps of obtaining sign language limb language video data, establishing a sign language syntax corpus and a sign language syntax error detection model to detect and correct text data error information, improving the feature accuracy of sign language images, and performing cross-mode countermeasure training on the corrected text data by establishing unified cross-mode characterization of the text data and the sign language limb language video data to generate feedback video data, so that cross-mode heterogeneous gaps between videos and texts are eliminated;

2. according to the characteristics of sign language expression of the deaf-mute, an end-to-end characteristic extraction frame is constructed, cross-modal semantic association is established between sign language limb information and a text instruction, and automatic machine understanding of the sign language of the deaf-mute is realized;

3. according to the invention, by referring to the understanding and correction modes of the deaf-mute teacher on the sign language, the prior knowledge support is provided for the syntactic reasoning of the sign language through the graph structural representation of the syntactic knowledge of the sign language, so that the intelligent correction of the sign language of the deaf-mute is realized;

4. an interactive mode of sign language training between a deaf-mute learner and a virtual teacher is constructed, and the automatic generation of the sign language feedback animation is completed by starting with collaborative training optimization of an countermeasure generation model.

Drawings

In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the drawings that are needed in the examples will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and that other related drawings may be obtained from these drawings without inventive effort for a person skilled in the art. In the drawings:

fig. 1 is a general flow of a method for intelligent correction of sign language of a deaf-mute in an embodiment of the present invention;

FIG. 2 is a machine understanding model of a deaf-mute sign in an embodiment of the invention;

FIG. 3 is a sign language syntax detection and correction model in an embodiment of the invention;

FIG. 4 is a sign language feedback animation generation model in an embodiment of the invention;

fig. 5 is an overall framework of the intelligent error correction system for sign language of the deaf-mute in the embodiment of the invention.

Detailed Description

For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.

Example 1

As shown in fig. 1, the method for correcting sign language of deaf-mute based on machine learning according to the first aspect of the present embodiment includes the following specific steps:

The method comprises the steps of obtaining sign language limb language video data, establishing a sign language syntax corpus and a sign language syntax error detection model to detect and correct text data error information, improving the feature accuracy of sign language images, and performing cross-mode countermeasure training on the corrected text data by establishing unified cross-mode characterization of the text data and the sign language limb language video data to generate feedback video data, so that cross-mode heterogeneous gaps between videos and texts are eliminated. And establishing a cross-modal unified characterization model among the video, text semantics and animation, performing cross-modal countermeasure training in a binary game mode, and automatically generating corrected sign language feedback demonstration animation. According to the method, the sign language video sample with the syntactic error can be converted into the corrected teaching feedback animation, a deaf-mute learner is guided to perform autonomous sign language cognitive training, the sign language sample demonstrated by the deaf-mute learner is solved, the syntactic error which does not accord with the sign language rule in the sample is detected to be corrected and fed back, and therefore learning interaction experience of the deaf-mute is improved.

As shown in fig. 2, constructing a video space-time synchronization feature coding framework to perform feature extraction on video data specifically includes:

Specifically, the ConvLSTM unit controls the cyclic transmission of the characteristic information therein by introducing a gate structure, and outputs the characteristic information to the external state h of the hidden layer in a screening mode _t ：

h _t ＝o _t ⊙tanh(c _t )

In the above, c _t-1 Indicating the output of the memory cell at the previous time, using the forgetting gate f at the corresponding time _t Controlling, wherein the operation indicates that Hadamard product is calculated for the vector elements;representing candidate states of nonlinear activation function output using input gate i _t Control is performed, and the calculation logic is as follows:

in the above, x _t Representing input information in a current state and storing the input information in a two-dimensional tensor form; h is a _t-1 Representing the output of the previous instant, representing the convolutionOperation (Convolition), W _c And U _c Weights respectively representing the input information at the current time and the previous time, b _c Representing a bias term; at time t, c _t All history information contained in the two-dimensional tensor transmitted to the current moment is contained as a state unit, and the proportion of information residues is controlled by using three gate structures:

f _t ＝σ(W _i *x _t +U _i *h _t-1 +b _i )

i _t ＝σ(W _f *x _t +U _f *h _t-1 +b _f )

o _t ＝σ(W _o *x _t +U _o *h _t-1 +b _o )

in the above, f _t Indicating forgetful door, i _t Represents an input gate, o _t Representing an output gate. Wherein forget door f _t According to the current input x _t And the output h of the last time _t-1 Determining information to be reserved for neurons at the current moment; input gate i _t According to x _t 、h _t-1 Forget gate weight U _f Control joining to candidate stateTo generate a new state c _t The method comprises the steps of carrying out a first treatment on the surface of the Output door o _t According to updated c _t State, determining the output value h at that time _t . Sigma (·) represents a sigmoid nonlinear activation function:

the ratio of the left information can be controlled by normalizing the output values of the three gates to the interval range of [0,1 ].

After the space-time information of the two-dimensional image sequence is transmitted by the ConvLSTM unit, the space-time information of the video can be effectively fused in the global range, and compared with the traditional one-dimensional long-short-time memory network, the space-time information of the video can be captured.

The sign language video is encoded into a two-dimensional tensor containing a large amount of space-time information through the feature extraction of two convolution time sequence layers (ConvLSTM 1-2); and then connects to two pooling layers (Pool 1-2) for dynamic discarding of features, the pooling layers being defined as:

in the above, x _t Pool, the output of a two-dimensional tensor, i.e. a convolutional timing layer (ConvLSTM) _max Representing the output obtained after pooling, n representing the window size of the pooling operation in the spatial dimension, the maximum value is extracted for all features of the two-dimensional tensor within the n region. Gradually extracting to high-level semantic features after 2-level pooling operation. Then, a linear leveling (Flat) operation is carried out on the space-time feature by using a normalization layer (Norml), and a one-dimensional linear sequence fused with the space-time feature is obtained. Before the classification layer, the linear sequence is subjected to characteristic selection by introducing an attention mechanism layer, and the specific process is expressed as follows:

in the above, w _ij According to h _i ' _-1 And h _j The calculation results are that:

in the above, v _ij Representing intermediate variables, last w _ij Satisfy the following requirements

The attention mechanism makes the model concentrate weight optimization on some key information of the code output according to training errors of sign language identification by carrying out weighted transformation on the original input sequence, thereby enhancing the association strength of the video space-time characteristics and sign language isolated words. After all feature extraction and screening are completed, the high-level semantic features are sent to a full-connection classifier (MLP, multilayer perceptron) and finally output into sentences containing keywords.

As shown in fig. 3 and fig. 5, the error detection of text data error information by the trained sign language syntax error detection model specifically includes:

shielding 10% of isolated words in the trained sign language syntax error detection model, so that the model does not know whether the isolated words at the current position conform to the whole semantics, the trained sign language syntax error detection model is guided to predict the isolated words depending on the context information, and error information in text data is detected;

In the error correction stage, the priori knowledge based on sign language syntax is used for guiding the error correction reasoning process, a syntax error correction candidate set is constructed by using a sign language knowledge graph, and the association information among the isolated word entities is used for filling and replacing the isolated words, so that the error correction of sentences is completed. Constructing a syntactic error correction candidate set according to the sign language syntactic corpus, and correcting text data error information, wherein the method specifically comprises the following steps of:

feature fusion was performed using Soft-mask:

training the fused features:

the method comprises the steps of randomly shielding an input original one-dimensional sequence according to set probability, enabling a model not to know whether shielding words at the current position are correct sign language isolated words, predicting original values of the shielding words according to context associated words, comparing and sequencing all candidate sets through error probability, and outputting candidate words with highest probability to fill and replace error correction. The calculation logic is as follows:

Multi-Head＝[Attention ₁ ,Attention ₂ ,...,Attention _n ]

in the above formula, Q, K, V are obtained by linear transformation of an input vector x, Q represents a matrix (Query) to be queried, K represents a Key Value matrix (Key) to be queried, V represents a real Value matrix (Value) to be queried, B represents a position offset matrix, d represents dimensions of a Query vector Q and a Value vector K, and the purpose of adjusting inner product values of Q and K transposition is to prevent uneven vector distribution after excessive inner product passes through a classifier. Similarly, multi-Head is a concatenation of multiple attentions, sign language frames are used as a slice in a time dimension after space extraction of high-dimensional features, and time correlation among multiple frames is obtained after time sequence information encoding.

In order to ensure the error correction accuracy of the language model, priori knowledge is introduced in a sign language knowledge graph mode, and the auxiliary syntax error correction reasoning is carried out. Nodes in the graph represent isolated word entities, and edges are word-to-word relationships. In the error correction process, the mask word label is input into the map according to the position code i. The computational logic in the graph structure afferent network module is as follows:

H ^(l) ＝f(H ^(l-1) ,A)

in the above, H ^(l) Represents the output layer, l represents the current layer number, A represents the identity matrix, if H ^(l＝0) =x represents the input layer. f (-) represents the layer-to-layer transfer, the computational logic of which is as follows:

in the above, H ^(l) Representing the characteristics of the current layer, W is a weight matrix,is a Degree matrix (Degrey matrix), whose computational logic is:

sigma () represents a nonlinear activation function, preferably using Rectified linear unit (ReLUs) as the activation function, the formula is as follows:

Relus(X)＝max(0,X)

in the above formula, if X of the input activation function is less than or equal to 0, X is forced to be equal to 0; if X > 0 is entered, the original value is retained. The activation function can ensure that the output has certain sparsity, so that the convergence speed of the model during training is accelerated.

The model searches error-prone words with close relation of the shielding words according to the connection edges between the entities, establishes association one by one according to the error probability p (i), and generates a graph data structure composed of the shielding words and the error-prone words. And then, carrying out aggregation operation on nodes on the graph by using a graph convolution neural network, wherein the obtained word embedding not only comprises vector representation of isolated words, but also comprises context semantic relation characteristics between shielding words and hyponyms. Finally, the aggregated graph structure information is injected into word embedding of the model, and the auxiliary model restores correct sentences.

As shown in fig. 4, S3 specifically includes:

Firstly, establishing cross-mode unified characterization between a text and an animation, so that information of different modes is embedded into the same public vector space, and the distance between samples can be measured and calculated; and then, using an countermeasure generation network to perform cross-mode countermeasure training in a game mode, and generating a demonstration animation of sign language feedback.

The computational logic for the sign language feedback animation generation model is as follows:

in the above expression, D represents a discrimination network, and G represents a generation network. The discrimination network D of the model uses the real sign language sample X as training data, and a true/false two-classifier is trained by extracting the characteristics of the real sample. The generating network device G uses random noise Z to perform distribution transformation according to the feedback statement sequence, and maps the input space to the sample space to obtain a generating sampleModel training phase, will->Sending into discrimination network D for true/false discrimination classification to make sample +.>The gap from the real sample X is as small as possible. With training iteration, the similarity between the generated sample and the target real sample gradually increases until the generated sampleCan be identified as a real sample by the discriminator D, and represents that the output animation features are completeRestoring limb semantics in the video sample.

The overall optimization objective function of the sign language feedback animation generation model is as follows:

in the above formula, the first half part is an optimization target for generating a network, and the second half part is an optimization target for discriminating the network; x-mu and z-gamma represent expected values of the samples subject to a specified distribution. The challenge-generating model converts the random sample distribution into the generated samples by the discriminant function and the maxima and minima of the generating function. Specifically, for a given generator G and optimization discriminator D, there is a constant attempt to assign high values to the true samples of the distribution, while assigning low values to the generated samples. The generator converts the random sample distribution into generated samples, and the discriminator tries to distinguish them from training samples from the real sign language data set, so that the generation of the sign language animation is converted into a binary non-convex optimization problem, the generated sign language animation samples are similar to the real sign language training samples in distribution, and the task of generating the feedback animation is met.

In some possible embodiments, generating the feedback video data further includes data processing the sign language limb language video data, extracting three-dimensional skeletal coordinates and two-dimensional image limb information, and sending the three-dimensional skeletal coordinates and the two-dimensional image limb information to a generation network;

The second aspect of the present embodiment is an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements a method for correcting sign language of a deaf-mute based on machine learning when executing the program.

A third aspect of the present embodiment is a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements a machine learning-based method for error correction of sign language for a deaf-mute.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The deaf-mute sign language error correction method based on machine learning is characterized by comprising the following specific steps of:

s2, establishing a sign language syntax corpus and a sign language syntax error detection model, pre-training the sign language sentence method error detection model according to the sign language syntax corpus, detecting errors of text data by the trained sign language syntax error detection model, constructing a syntax error correction candidate set according to the sign language syntax corpus, and correcting errors of the text data;

the method for detecting text data error information through the trained sign language syntax error detection model specifically comprises the following steps:

converting the two-dimensional image into a one-dimensional linear sequence containing fine-granularity region codes according to slice and position codes, generating an embedded vector of a statement to be corrected and the error probability of each position in advance, and if the error probability is greater than a set threshold value, indicating that the corresponding position is in error;

the method for constructing the syntactic error correction candidate set according to the sign language syntactic corpus, and correcting the text data error information specifically comprises the following steps:

feature fusion was performed using Soft-mask:

training the fused features:

randomly shielding an input original one-dimensional sequence according to a set probability, so that a model does not know whether a shielding word at the current position is a correct sign language isolated word, predicting the original value of the shielding word by depending on a context associated vocabulary, comparing and sequencing all candidate sets by error probability, and outputting a candidate word with the highest probability for filling and replacing error correction;

2. The machine learning-based sign language error correction method for the deaf-mute is characterized by comprising the steps of constructing a video space-time synchronous feature coding framework to extract features of video data, and specifically comprising the following steps:

3. The machine learning-based sign language error correction method for the deaf-mute according to claim 2, wherein the space channel adopts ConvLSTM to perform deep learning on a sign language two-dimensional image sequence, and specifically comprises the following steps:

4. The machine learning-based sign language error correction method for the deaf-mute is characterized in that the method further comprises the following steps in the feature extraction process by adopting the long-short-time memory network ConvLSTM:

constructing a forgetting door, an input door and an output door;

5. The machine learning-based sign language error correction method for the deaf-mute according to claim 1, wherein the step S3 specifically comprises:

6. The machine learning-based sign language error correction method for the deaf-mute according to claim 1, wherein the generating feedback video data further comprises data processing of sign language limb language video data, extracting three-dimensional skeleton coordinates and two-dimensional image limb information, and sending the three-dimensional skeleton coordinates and the two-dimensional image limb information to a generating network;

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a machine learning based sign language error correction method as claimed in any one of claims 1 to 6 when the program is executed by the processor.

8. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a machine learning based sign language error correction method for a deaf-mute as claimed in any one of claims 1 to 6.