CN116151226B - Machine learning-based deaf-mute sign language error correction method, equipment and medium - Google Patents

Machine learning-based deaf-mute sign language error correction method, equipment and medium Download PDF

Info

Publication number
CN116151226B
CN116151226B CN202211632041.3A CN202211632041A CN116151226B CN 116151226 B CN116151226 B CN 116151226B CN 202211632041 A CN202211632041 A CN 202211632041A CN 116151226 B CN116151226 B CN 116151226B
Authority
CN
China
Prior art keywords
sign language
error
deaf
mute
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211632041.3A
Other languages
Chinese (zh)
Other versions
CN116151226A (en
Inventor
梁智杰
武锐霞
杨娟
吴长城
李红霞
冯朝胜
王玲
毛笋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Normal University
Sichuan Water Conservancy Vocational College
Original Assignee
Sichuan Normal University
Sichuan Water Conservancy Vocational College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Normal University, Sichuan Water Conservancy Vocational College filed Critical Sichuan Normal University
Priority to CN202211632041.3A priority Critical patent/CN116151226B/en
Publication of CN116151226A publication Critical patent/CN116151226A/en
Application granted granted Critical
Publication of CN116151226B publication Critical patent/CN116151226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a sign language error correction method, device and medium for a deaf-mute based on machine learning, which are characterized in that a sign language syntax corpus and a sign language syntax error detection model are established to detect and correct text data error information, the feature accuracy of sign language images is improved, the cross-modal countermeasure training is carried out on the corrected text data through establishing the unified characterization of the text data and the sign language limb language video data, the cross-modal heterogeneous gap between the video and the text is eliminated, the sign language video sample with the syntax error can be converted into corrected teaching feedback animation, the deaf-mute is guided to carry out autonomous sign language cognitive training, the sign language sample demonstrated by the deaf-mute is decomposed, and the syntax error which does not accord with the sign language rule in the sample is detected to carry out correction feedback, so that the learning interactive experience of the deaf-mute is improved.

Description

Machine learning-based deaf-mute sign language error correction method, equipment and medium
Technical Field
The invention relates to the field of data processing, in particular to a method, equipment and medium for correcting sign language of a deaf-mute based on machine learning.
Background
Sign language is the most intuitive way for the deaf-mute to express complaints and acquire information services, and is also the first language for the deaf-mute to learn in a natural state. In recent years, the explosive development of artificial intelligence (Artificial Intelligence, AI) has made intelligent sign language teaching possible. If the machine can simulate sign language teachers, the sign language expressions of the deaf-mute can be automatically understood, and error information in the sign language expressions can be searched for feedback correction, so that the workload of the sign language teachers can be reduced, and the machine can be guided by standardized actions, so that the machine can develop good expression habits in early sign language cognitive training, and a solid foundation is laid for later-stage equal participation in social life. At present, computer-aided sign language teaching research is still in a starting stage, and the intelligent sign language correction feedback is realized, so that the following two technical difficulties are still faced.
Intelligent sign language cognition assistance centered on a deaf-mute learner first requires that the machine be able to accurately recognize sign language. An early representative internationalization study was the Kids Sign Online (KSO). Aiming at the cognitive characteristics of infants, the study firstly demonstrates a section of executing sign language video for the deaf infants, then collects the imitated actions of the deaf infants, inputs the imitated actions into a sign language analyzer driven by a hidden Markov time sequence model (HMM, hidden Markov Model), and calculates the matching degree score between the imitated actions and standard sign language through comparison of two sections of sequences. In China, gao Wen and the like propose deaf-mute sign language translation teaching based on a multi-mode interface, the study utilizes a time sequence clustering algorithm to identify sign language videos, and a two-dimensional animation is initially used for teaching demonstration on Chinese sign language. However, limited to the semantic complexity of sign language video, such studies rely primarily on human experience for feature extraction of sign language images, suitable for analysis and measurement of small-scale sign language data. To solve such problems, it is necessary to extract the temporal and spatial features of sign language video synchronously, so as to implement accurate translation of sign language by a machine.
The neural network represented by deep learning provides a trigger for further understanding of the sign language of the deaf-mute by the machine, and the national natural science foundation committee in the United states has continuously sponsored the research SAIL (Signing Avatars and Immersive Learning) of using artificial intelligence and deep learning technology to combine and promote the teaching cognition of the sign language of the deaf-mute and the research EAGR (Efficient Avatars Generator Recognition) of the sign language recognition and generation of the deaf-mute in recent years. However, the above-mentioned study can only provide unidirectional sign language recognition or demonstration, and cannot provide targeted interactive guidance according to the sign language grasp condition of the individual of the deaf-mute. Based on the bidirectional features of teaching interaction, the machine needs to detect missing vocabulary and syntax errors of sign language first, and then provides accurate personalized feedback according to the cognitive situation of each learner. This involves the problem of sign language video understanding and high level syntactic reasoning, and the need to overcome the cross-modal heterogeneous gap between video and text remains a challenging technical problem in the current artificial intelligence field.
Disclosure of Invention
The invention aims to provide a sign language error correction method, device and medium for a deaf-mute based on machine learning, which are used for detecting and correcting text data error information by acquiring sign language limb language video data, establishing a sign language syntax corpus and a sign language syntax error detection model, improving the accuracy of the characteristics of the sign language image, performing cross-modal countermeasure training on the corrected text data by establishing unified cross-modal characterization of the text data and the sign language limb language video data, generating feedback video data, and eliminating the cross-modal heterogeneous gap between the video and the text.
The invention is realized by the following technical scheme:
the invention relates to a machine learning-based sign language error correction method for a deaf-mute, which comprises the following specific steps:
s1, acquiring sign language limb language video data, constructing a video space-time synchronization feature coding frame, extracting features of the video data, and translating the extracted features to obtain text data;
s2, establishing a sign language syntax corpus and a sign language syntax error detection model, pre-training the sign language syntax error detection model according to the sign language syntax corpus, detecting errors of text data by the trained sign language syntax error detection model, constructing a syntax error correction candidate set according to the sign language syntax corpus, and correcting errors of the text data;
s3, establishing cross-modal unified characterization of the text data and the sign language limb language video data, embedding different modal information into the same common vector space, constructing a sign language feedback animation generation model, and performing cross-modal countermeasure training on the corrected text data to generate feedback video data.
According to the invention, by acquiring the sign language limb language video data, establishing a sign language syntax corpus and a sign language syntax error detection model to detect and correct text data error information, improving the feature accuracy of sign language images, and by establishing the cross-modal unified characterization of the text data and the sign language limb language video data, performing cross-modal countermeasure training on the corrected text data to generate feedback video data, and eliminating the cross-modal heterogeneous gap between the video and the text. And establishing a cross-modal unified characterization model among the video, text semantics and animation, performing cross-modal countermeasure training in a binary game mode, and automatically generating corrected sign language feedback demonstration animation. According to the method, the sign language video sample with the syntactic error can be converted into the corrected teaching feedback animation, a deaf-mute learner is guided to perform autonomous sign language cognitive training, the sign language sample demonstrated by the deaf-mute learner is solved, the syntactic error which does not accord with the sign language rule in the sample is detected to be corrected and fed back, and therefore learning interaction experience of the deaf-mute is improved.
Further, the constructing the video space-time synchronization feature coding framework performs feature extraction on video data, and specifically includes:
and extracting synchronous features of the two-dimensional space domain image and the one-dimensional time domain sequence from the video data by adopting a long-short-time memory network ConvLSTM with convolution operation, wherein the space channel adopts ConvLSTM to carry out deep learning on the sign language two-dimensional image sequence, the local time domain channel ConvLSTM carries out deep learning on optical flow features, and the global time domain channel ConvLSTM carries out deep learning on video semantic actions.
Furthermore, the space channel adopts ConvLSTM to carry out deep learning on the sign language two-dimensional image sequence, and the method specifically comprises the following steps:
performing feature extraction by adopting two convolution time sequence layers, and encoding sign language video into a two-dimensional tensor containing space-time information;
obtaining the maximum value of all feature extraction of a two-dimensional tensor in a designated area and the window size of pooling operation in a space dimension, and determining the output of a pooling layer according to the output of a convolution time sequence layer to obtain high-level semantic features;
carrying out linear leveling operation on the space-time characteristics by adopting a normalization layer to obtain a one-dimensional linear sequence fused with the space-time characteristics;
and introducing an attention mechanism layer, extracting and screening the characteristics of the linear sequence, sending the high-level semantic characteristics into a full-connection classifier, and outputting sentences containing keywords.
Further, in the process of extracting the features by adopting the long-short-time memory network ConvLSTM, the method further comprises the following steps:
constructing a forgetting door, an input door and an output door;
the forgetting gate determines information to be reserved by the neuron at the current moment according to the current input and the output at the last moment;
the input gate controls the proportion of information added into the candidate state according to the current input, the output at the last moment and the weight of the forgetting gate so as to generate a new state;
and the output gate determines the output value at the moment according to the new state, and controls the information residual proportion of the video data according to the normalized range of the output value.
Further, the error detection of text data error information by the trained sign language syntax error detection model specifically includes:
shielding the isolated words in the trained sign language syntax error detection model, so that the model does not know whether the isolated words at the current position conform to the whole semantics, the trained sign language syntax error detection model is guided to predict the isolated words depending on the context information, and error information in text data is detected;
determining the sentence embedding vector to be corrected and the sentence output after the sign language syntax error detection model correction according to the text data error information, and uniformly segmenting the video frame by frame in the space dimension;
and converting the two-dimensional image into a one-dimensional linear sequence containing fine-granularity region codes according to slice and position codes, generating an embedded vector of a statement to be corrected and the probability of error of each position in advance, and if the value is larger than a set threshold value, indicating that the corresponding position is in error.
Further, the constructing a syntactic error correction candidate set according to the sign language syntactic corpus, and correcting the text data error information specifically includes:
feature fusion was performed using Soft-mask:
the error probability of each position is multiplied by the characteristic of the mask character to be input as a first part of an error corrector, the non-error probability is multiplied by the original input characteristic to be input as a second part, and the two parts are added to be the characteristic of each character;
training the fused features:
the method comprises the steps of randomly shielding an input original one-dimensional sequence according to set probability, enabling a model not to know whether shielding words at the current position are correct sign language isolated words, predicting original values of the shielding words according to context associated words, comparing and sequencing all candidate sets through error probability, and outputting candidate words with highest probability to fill and replace error correction.
Further, the step S3 specifically includes:
constructing a discrimination network, adopting sign language limb language video data as training data, and generating a true/false classifier by extracting the characteristic training of a real sample;
constructing a generating network, acquiring a feedback statement sequence, performing distribution transformation by using random noise, and mapping an input space to a sample space to obtain a generated sample;
training the sign language feedback animation generation model, sending the generated sample into a discrimination network for true/false discrimination and classification, and continuously iterating until the difference between the generated sample and the true sample is smaller than a threshold value, and generating feedback video data through the sign language feedback animation generation model.
Further, the generating of the feedback video data further comprises data processing of sign language limb language video data, extracting of three-dimensional skeleton coordinates and two-dimensional image limb information, and sending the three-dimensional skeleton coordinates and the two-dimensional image limb information to a generating network;
and calculating the similarity of the limb track frame by frame through a discrimination network, and generating feedback video data demonstration animation in real time.
The second aspect of the invention provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes a deaf-mute sign language error correction method based on machine learning when executing the program.
A third aspect of the present invention is a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a machine learning-based method of sign language correction for a deaf-mute.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the method comprises the steps of obtaining sign language limb language video data, establishing a sign language syntax corpus and a sign language syntax error detection model to detect and correct text data error information, improving the feature accuracy of sign language images, and performing cross-mode countermeasure training on the corrected text data by establishing unified cross-mode characterization of the text data and the sign language limb language video data to generate feedback video data, so that cross-mode heterogeneous gaps between videos and texts are eliminated;
2. according to the characteristics of sign language expression of the deaf-mute, an end-to-end characteristic extraction frame is constructed, cross-modal semantic association is established between sign language limb information and a text instruction, and automatic machine understanding of the sign language of the deaf-mute is realized;
3. according to the invention, by referring to the understanding and correction modes of the deaf-mute teacher on the sign language, the prior knowledge support is provided for the syntactic reasoning of the sign language through the graph structural representation of the syntactic knowledge of the sign language, so that the intelligent correction of the sign language of the deaf-mute is realized;
4. an interactive mode of sign language training between a deaf-mute learner and a virtual teacher is constructed, and the automatic generation of the sign language feedback animation is completed by starting with collaborative training optimization of an countermeasure generation model.
Drawings
In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the drawings that are needed in the examples will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and that other related drawings may be obtained from these drawings without inventive effort for a person skilled in the art. In the drawings:
fig. 1 is a general flow of a method for intelligent correction of sign language of a deaf-mute in an embodiment of the present invention;
FIG. 2 is a machine understanding model of a deaf-mute sign in an embodiment of the invention;
FIG. 3 is a sign language syntax detection and correction model in an embodiment of the invention;
FIG. 4 is a sign language feedback animation generation model in an embodiment of the invention;
fig. 5 is an overall framework of the intelligent error correction system for sign language of the deaf-mute in the embodiment of the invention.
Detailed Description
For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.
Example 1
As shown in fig. 1, the method for correcting sign language of deaf-mute based on machine learning according to the first aspect of the present embodiment includes the following specific steps:
s1, acquiring sign language limb language video data, constructing a video space-time synchronization feature coding frame, extracting features of the video data, and translating the extracted features to obtain text data;
s2, establishing a sign language syntax corpus and a sign language syntax error detection model, pre-training the sign language syntax error detection model according to the sign language syntax corpus, detecting errors of text data by the trained sign language syntax error detection model, constructing a syntax error correction candidate set according to the sign language syntax corpus, and correcting errors of the text data;
s3, establishing cross-modal unified characterization of the text data and the sign language limb language video data, embedding different modal information into the same common vector space, constructing a sign language feedback animation generation model, and performing cross-modal countermeasure training on the corrected text data to generate feedback video data.
The method comprises the steps of obtaining sign language limb language video data, establishing a sign language syntax corpus and a sign language syntax error detection model to detect and correct text data error information, improving the feature accuracy of sign language images, and performing cross-mode countermeasure training on the corrected text data by establishing unified cross-mode characterization of the text data and the sign language limb language video data to generate feedback video data, so that cross-mode heterogeneous gaps between videos and texts are eliminated. And establishing a cross-modal unified characterization model among the video, text semantics and animation, performing cross-modal countermeasure training in a binary game mode, and automatically generating corrected sign language feedback demonstration animation. According to the method, the sign language video sample with the syntactic error can be converted into the corrected teaching feedback animation, a deaf-mute learner is guided to perform autonomous sign language cognitive training, the sign language sample demonstrated by the deaf-mute learner is solved, the syntactic error which does not accord with the sign language rule in the sample is detected to be corrected and fed back, and therefore learning interaction experience of the deaf-mute is improved.
As shown in fig. 2, constructing a video space-time synchronization feature coding framework to perform feature extraction on video data specifically includes:
and extracting synchronous features of the two-dimensional space domain image and the one-dimensional time domain sequence from the video data by adopting a long-short-time memory network ConvLSTM with convolution operation, wherein the space channel adopts ConvLSTM to carry out deep learning on the sign language two-dimensional image sequence, the local time domain channel ConvLSTM carries out deep learning on optical flow features, and the global time domain channel ConvLSTM carries out deep learning on video semantic actions.
Specifically, the ConvLSTM unit controls the cyclic transmission of the characteristic information therein by introducing a gate structure, and outputs the characteristic information to the external state h of the hidden layer in a screening mode t
h t =o t ⊙tanh(c t )
In the above, c t-1 Indicating the output of the memory cell at the previous time, using the forgetting gate f at the corresponding time t Controlling, wherein the operation indicates that Hadamard product is calculated for the vector elements;representing candidate states of nonlinear activation function output using input gate i t Control is performed, and the calculation logic is as follows:
in the above, x t Representing input information in a current state and storing the input information in a two-dimensional tensor form; h is a t-1 Representing the output of the previous instant, representing the convolutionOperation (Convolition), W c And U c Weights respectively representing the input information at the current time and the previous time, b c Representing a bias term; at time t, c t All history information contained in the two-dimensional tensor transmitted to the current moment is contained as a state unit, and the proportion of information residues is controlled by using three gate structures:
f t =σ(W i *x t +U i *h t-1 +b i )
i t =σ(W f *x t +U f *h t-1 +b f )
o t =σ(W o *x t +U o *h t-1 +b o )
in the above, f t Indicating forgetful door, i t Represents an input gate, o t Representing an output gate. Wherein forget door f t According to the current input x t And the output h of the last time t-1 Determining information to be reserved for neurons at the current moment; input gate i t According to x t 、h t-1 Forget gate weight U f Control joining to candidate stateTo generate a new state c t The method comprises the steps of carrying out a first treatment on the surface of the Output door o t According to updated c t State, determining the output value h at that time t . Sigma (·) represents a sigmoid nonlinear activation function:
the ratio of the left information can be controlled by normalizing the output values of the three gates to the interval range of [0,1 ].
After the space-time information of the two-dimensional image sequence is transmitted by the ConvLSTM unit, the space-time information of the video can be effectively fused in the global range, and compared with the traditional one-dimensional long-short-time memory network, the space-time information of the video can be captured.
The sign language video is encoded into a two-dimensional tensor containing a large amount of space-time information through the feature extraction of two convolution time sequence layers (ConvLSTM 1-2); and then connects to two pooling layers (Pool 1-2) for dynamic discarding of features, the pooling layers being defined as:
in the above, x t Pool, the output of a two-dimensional tensor, i.e. a convolutional timing layer (ConvLSTM) max Representing the output obtained after pooling, n representing the window size of the pooling operation in the spatial dimension, the maximum value is extracted for all features of the two-dimensional tensor within the n region. Gradually extracting to high-level semantic features after 2-level pooling operation. Then, a linear leveling (Flat) operation is carried out on the space-time feature by using a normalization layer (Norml), and a one-dimensional linear sequence fused with the space-time feature is obtained. Before the classification layer, the linear sequence is subjected to characteristic selection by introducing an attention mechanism layer, and the specific process is expressed as follows:
in the above, w ij According to h i ' -1 And h j The calculation results are that:
in the above, v ij Representing intermediate variables, last w ij Satisfy the following requirements
The attention mechanism makes the model concentrate weight optimization on some key information of the code output according to training errors of sign language identification by carrying out weighted transformation on the original input sequence, thereby enhancing the association strength of the video space-time characteristics and sign language isolated words. After all feature extraction and screening are completed, the high-level semantic features are sent to a full-connection classifier (MLP, multilayer perceptron) and finally output into sentences containing keywords.
As shown in fig. 3 and fig. 5, the error detection of text data error information by the trained sign language syntax error detection model specifically includes:
shielding 10% of isolated words in the trained sign language syntax error detection model, so that the model does not know whether the isolated words at the current position conform to the whole semantics, the trained sign language syntax error detection model is guided to predict the isolated words depending on the context information, and error information in text data is detected;
determining the sentence embedding vector to be corrected and the sentence output after the sign language syntax error detection model correction according to the text data error information, and uniformly segmenting the video frame by frame in the space dimension;
and converting the two-dimensional image into a one-dimensional linear sequence containing fine-granularity region codes according to slice and position codes, generating an embedded vector of a statement to be corrected and the probability of error of each position in advance, and if the value is larger than a set threshold value, indicating that the corresponding position is in error.
In the error correction stage, the priori knowledge based on sign language syntax is used for guiding the error correction reasoning process, a syntax error correction candidate set is constructed by using a sign language knowledge graph, and the association information among the isolated word entities is used for filling and replacing the isolated words, so that the error correction of sentences is completed. Constructing a syntactic error correction candidate set according to the sign language syntactic corpus, and correcting text data error information, wherein the method specifically comprises the following steps of:
feature fusion was performed using Soft-mask:
the error probability of each position is multiplied by the characteristic of the mask character to be input as a first part of an error corrector, the non-error probability is multiplied by the original input characteristic to be input as a second part, and the two parts are added to be the characteristic of each character;
training the fused features:
the method comprises the steps of randomly shielding an input original one-dimensional sequence according to set probability, enabling a model not to know whether shielding words at the current position are correct sign language isolated words, predicting original values of the shielding words according to context associated words, comparing and sequencing all candidate sets through error probability, and outputting candidate words with highest probability to fill and replace error correction. The calculation logic is as follows:
Multi-Head=[Attention 1 ,Attention 2 ,...,Attention n ]
in the above formula, Q, K, V are obtained by linear transformation of an input vector x, Q represents a matrix (Query) to be queried, K represents a Key Value matrix (Key) to be queried, V represents a real Value matrix (Value) to be queried, B represents a position offset matrix, d represents dimensions of a Query vector Q and a Value vector K, and the purpose of adjusting inner product values of Q and K transposition is to prevent uneven vector distribution after excessive inner product passes through a classifier. Similarly, multi-Head is a concatenation of multiple attentions, sign language frames are used as a slice in a time dimension after space extraction of high-dimensional features, and time correlation among multiple frames is obtained after time sequence information encoding.
In order to ensure the error correction accuracy of the language model, priori knowledge is introduced in a sign language knowledge graph mode, and the auxiliary syntax error correction reasoning is carried out. Nodes in the graph represent isolated word entities, and edges are word-to-word relationships. In the error correction process, the mask word label is input into the map according to the position code i. The computational logic in the graph structure afferent network module is as follows:
H (l) =f(H (l-1) ,A)
in the above, H (l) Represents the output layer, l represents the current layer number, A represents the identity matrix, if H (l=0) =x represents the input layer. f (-) represents the layer-to-layer transfer, the computational logic of which is as follows:
in the above, H (l) Representing the characteristics of the current layer, W is a weight matrix,is a Degree matrix (Degrey matrix), whose computational logic is:
sigma () represents a nonlinear activation function, preferably using Rectified linear unit (ReLUs) as the activation function, the formula is as follows:
Relus(X)=max(0,X)
in the above formula, if X of the input activation function is less than or equal to 0, X is forced to be equal to 0; if X > 0 is entered, the original value is retained. The activation function can ensure that the output has certain sparsity, so that the convergence speed of the model during training is accelerated.
The model searches error-prone words with close relation of the shielding words according to the connection edges between the entities, establishes association one by one according to the error probability p (i), and generates a graph data structure composed of the shielding words and the error-prone words. And then, carrying out aggregation operation on nodes on the graph by using a graph convolution neural network, wherein the obtained word embedding not only comprises vector representation of isolated words, but also comprises context semantic relation characteristics between shielding words and hyponyms. Finally, the aggregated graph structure information is injected into word embedding of the model, and the auxiliary model restores correct sentences.
As shown in fig. 4, S3 specifically includes:
constructing a discrimination network, adopting sign language limb language video data as training data, and generating a true/false classifier by extracting the characteristic training of a real sample;
constructing a generating network, acquiring a feedback statement sequence, performing distribution transformation by using random noise, and mapping an input space to a sample space to obtain a generated sample;
training the sign language feedback animation generation model, sending the generated sample into a discrimination network for true/false discrimination and classification, and continuously iterating until the difference between the generated sample and the true sample is smaller than a threshold value, and generating feedback video data through the sign language feedback animation generation model.
Firstly, establishing cross-mode unified characterization between a text and an animation, so that information of different modes is embedded into the same public vector space, and the distance between samples can be measured and calculated; and then, using an countermeasure generation network to perform cross-mode countermeasure training in a game mode, and generating a demonstration animation of sign language feedback.
The computational logic for the sign language feedback animation generation model is as follows:
in the above expression, D represents a discrimination network, and G represents a generation network. The discrimination network D of the model uses the real sign language sample X as training data, and a true/false two-classifier is trained by extracting the characteristics of the real sample. The generating network device G uses random noise Z to perform distribution transformation according to the feedback statement sequence, and maps the input space to the sample space to obtain a generating sampleModel training phase, will->Sending into discrimination network D for true/false discrimination classification to make sample +.>The gap from the real sample X is as small as possible. With training iteration, the similarity between the generated sample and the target real sample gradually increases until the generated sampleCan be identified as a real sample by the discriminator D, and represents that the output animation features are completeRestoring limb semantics in the video sample.
The overall optimization objective function of the sign language feedback animation generation model is as follows:
in the above formula, the first half part is an optimization target for generating a network, and the second half part is an optimization target for discriminating the network; x-mu and z-gamma represent expected values of the samples subject to a specified distribution. The challenge-generating model converts the random sample distribution into the generated samples by the discriminant function and the maxima and minima of the generating function. Specifically, for a given generator G and optimization discriminator D, there is a constant attempt to assign high values to the true samples of the distribution, while assigning low values to the generated samples. The generator converts the random sample distribution into generated samples, and the discriminator tries to distinguish them from training samples from the real sign language data set, so that the generation of the sign language animation is converted into a binary non-convex optimization problem, the generated sign language animation samples are similar to the real sign language training samples in distribution, and the task of generating the feedback animation is met.
In some possible embodiments, generating the feedback video data further includes data processing the sign language limb language video data, extracting three-dimensional skeletal coordinates and two-dimensional image limb information, and sending the three-dimensional skeletal coordinates and the two-dimensional image limb information to a generation network;
and calculating the similarity of the limb track frame by frame through a discrimination network, and generating feedback video data demonstration animation in real time.
The second aspect of the present embodiment is an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements a method for correcting sign language of a deaf-mute based on machine learning when executing the program.
A third aspect of the present embodiment is a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements a machine learning-based method for error correction of sign language for a deaf-mute.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (8)

1. The deaf-mute sign language error correction method based on machine learning is characterized by comprising the following specific steps of:
s1, acquiring sign language limb language video data, constructing a video space-time synchronization feature coding frame, extracting features of the video data, and translating the extracted features to obtain text data;
s2, establishing a sign language syntax corpus and a sign language syntax error detection model, pre-training the sign language sentence method error detection model according to the sign language syntax corpus, detecting errors of text data by the trained sign language syntax error detection model, constructing a syntax error correction candidate set according to the sign language syntax corpus, and correcting errors of the text data;
the method for detecting text data error information through the trained sign language syntax error detection model specifically comprises the following steps:
shielding the isolated words in the trained sign language syntax error detection model, so that the model does not know whether the isolated words at the current position conform to the whole semantics, the trained sign language syntax error detection model is guided to predict the isolated words depending on the context information, and error information in text data is detected;
determining the sentence embedding vector to be corrected and the sentence output after the sign language syntax error detection model correction according to the text data error information, and uniformly segmenting the video frame by frame in the space dimension;
converting the two-dimensional image into a one-dimensional linear sequence containing fine-granularity region codes according to slice and position codes, generating an embedded vector of a statement to be corrected and the error probability of each position in advance, and if the error probability is greater than a set threshold value, indicating that the corresponding position is in error;
the method for constructing the syntactic error correction candidate set according to the sign language syntactic corpus, and correcting the text data error information specifically comprises the following steps:
feature fusion was performed using Soft-mask:
the error probability of each position is multiplied by the characteristic of the mask character to be input as a first part of an error corrector, the non-error probability is multiplied by the original input characteristic to be input as a second part, and the two parts are added to be the characteristic of each character;
training the fused features:
randomly shielding an input original one-dimensional sequence according to a set probability, so that a model does not know whether a shielding word at the current position is a correct sign language isolated word, predicting the original value of the shielding word by depending on a context associated vocabulary, comparing and sequencing all candidate sets by error probability, and outputting a candidate word with the highest probability for filling and replacing error correction;
s3, establishing cross-modal unified characterization of the text data and the sign language limb language video data, embedding different modal information into the same common vector space, constructing a sign language feedback animation generation model, and performing cross-modal countermeasure training on the corrected text data to generate feedback video data.
2. The machine learning-based sign language error correction method for the deaf-mute is characterized by comprising the steps of constructing a video space-time synchronous feature coding framework to extract features of video data, and specifically comprising the following steps:
and extracting synchronous features of the two-dimensional space domain image and the one-dimensional time domain sequence from the video data by adopting a long-short-time memory network ConvLSTM with convolution operation, wherein the space channel adopts ConvLSTM to carry out deep learning on the sign language two-dimensional image sequence, the local time domain channel ConvLSTM carries out deep learning on optical flow features, and the global time domain channel ConvLSTM carries out deep learning on video semantic actions.
3. The machine learning-based sign language error correction method for the deaf-mute according to claim 2, wherein the space channel adopts ConvLSTM to perform deep learning on a sign language two-dimensional image sequence, and specifically comprises the following steps:
performing feature extraction by adopting two convolution time sequence layers, and encoding sign language video into a two-dimensional tensor containing space-time information;
obtaining the maximum value of all feature extraction of a two-dimensional tensor in a designated area and the window size of pooling operation in a space dimension, and determining the output of a pooling layer according to the output of a convolution time sequence layer to obtain high-level semantic features;
carrying out linear leveling operation on the space-time characteristics by adopting a normalization layer to obtain a one-dimensional linear sequence fused with the space-time characteristics;
and introducing an attention mechanism layer, extracting and screening the characteristics of the linear sequence, sending the high-level semantic characteristics into a full-connection classifier, and outputting sentences containing keywords.
4. The machine learning-based sign language error correction method for the deaf-mute is characterized in that the method further comprises the following steps in the feature extraction process by adopting the long-short-time memory network ConvLSTM:
constructing a forgetting door, an input door and an output door;
the forgetting gate determines information to be reserved by the neuron at the current moment according to the current input and the output at the last moment;
the input gate controls the proportion of information added into the candidate state according to the current input, the output at the last moment and the weight of the forgetting gate so as to generate a new state;
and the output gate determines the output value at the moment according to the new state, and controls the information residual proportion of the video data according to the normalized range of the output value.
5. The machine learning-based sign language error correction method for the deaf-mute according to claim 1, wherein the step S3 specifically comprises:
constructing a discrimination network, adopting sign language limb language video data as training data, and generating a true/false classifier by extracting the characteristic training of a real sample;
constructing a generating network, acquiring a feedback statement sequence, performing distribution transformation by using random noise, and mapping an input space to a sample space to obtain a generated sample;
training the sign language feedback animation generation model, sending the generated sample into a discrimination network for true/false discrimination and classification, and continuously iterating until the difference between the generated sample and the true sample is smaller than a threshold value, and generating feedback video data through the sign language feedback animation generation model.
6. The machine learning-based sign language error correction method for the deaf-mute according to claim 1, wherein the generating feedback video data further comprises data processing of sign language limb language video data, extracting three-dimensional skeleton coordinates and two-dimensional image limb information, and sending the three-dimensional skeleton coordinates and the two-dimensional image limb information to a generating network;
and calculating the similarity of the limb track frame by frame through a discrimination network, and generating feedback video data demonstration animation in real time.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a machine learning based sign language error correction method as claimed in any one of claims 1 to 6 when the program is executed by the processor.
8. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a machine learning based sign language error correction method for a deaf-mute as claimed in any one of claims 1 to 6.
CN202211632041.3A 2022-12-19 2022-12-19 Machine learning-based deaf-mute sign language error correction method, equipment and medium Active CN116151226B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211632041.3A CN116151226B (en) 2022-12-19 2022-12-19 Machine learning-based deaf-mute sign language error correction method, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211632041.3A CN116151226B (en) 2022-12-19 2022-12-19 Machine learning-based deaf-mute sign language error correction method, equipment and medium

Publications (2)

Publication Number Publication Date
CN116151226A CN116151226A (en) 2023-05-23
CN116151226B true CN116151226B (en) 2024-02-23

Family

ID=86339973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211632041.3A Active CN116151226B (en) 2022-12-19 2022-12-19 Machine learning-based deaf-mute sign language error correction method, equipment and medium

Country Status (1)

Country Link
CN (1) CN116151226B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463250A (en) * 2014-12-12 2015-03-25 广东工业大学 Sign language recognition translation method based on Davinci technology
CN108256458A (en) * 2018-01-04 2018-07-06 东北大学 A kind of two-way real-time translation system and method for deaf person's nature sign language
CN109922371A (en) * 2019-03-11 2019-06-21 青岛海信电器股份有限公司 Natural language processing method, equipment and storage medium
CN110070065A (en) * 2019-04-30 2019-07-30 李冠津 The sign language systems and the means of communication of view-based access control model and speech-sound intelligent
WO2019226051A1 (en) * 2018-05-25 2019-11-28 Kepler Vision Technologies B.V. Monitoring and analyzing body language with machine learning, using artificial intelligence systems for improving interaction between humans, and humans and robots
KR102081854B1 (en) * 2019-08-01 2020-02-26 전자부품연구원 Method and apparatus for sign language or gesture recognition using 3D EDM
CN113609922A (en) * 2021-07-13 2021-11-05 中国矿业大学 Continuous sign language sentence recognition method based on mode matching
CN113780059A (en) * 2021-07-24 2021-12-10 上海大学 Continuous sign language identification method based on multiple feature points
CN113822187A (en) * 2021-09-10 2021-12-21 阿里巴巴达摩院(杭州)科技有限公司 Sign language translation, customer service, communication method, device and readable medium
CN114283493A (en) * 2021-12-09 2022-04-05 深圳市尚影视界科技有限公司 Artificial intelligence-based identification system
CN114817465A (en) * 2022-04-14 2022-07-29 海信电子科技(武汉)有限公司 Entity error correction method and intelligent device for multi-language semantic understanding
CN114842547A (en) * 2022-01-11 2022-08-02 南京工业大学 Sign language teaching method, device and system based on gesture action generation and recognition

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11074495B2 (en) * 2013-02-28 2021-07-27 Z Advanced Computing, Inc. (Zac) System and method for extremely efficient image and pattern recognition and artificial intelligence platform
US11263409B2 (en) * 2017-11-03 2022-03-01 Board Of Trustees Of Michigan State University System and apparatus for non-intrusive word and sentence level sign language translation
US10388272B1 (en) * 2018-12-04 2019-08-20 Sorenson Ip Holdings, Llc Training speech recognition systems using word sequences

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463250A (en) * 2014-12-12 2015-03-25 广东工业大学 Sign language recognition translation method based on Davinci technology
CN108256458A (en) * 2018-01-04 2018-07-06 东北大学 A kind of two-way real-time translation system and method for deaf person's nature sign language
WO2019226051A1 (en) * 2018-05-25 2019-11-28 Kepler Vision Technologies B.V. Monitoring and analyzing body language with machine learning, using artificial intelligence systems for improving interaction between humans, and humans and robots
CN109922371A (en) * 2019-03-11 2019-06-21 青岛海信电器股份有限公司 Natural language processing method, equipment and storage medium
CN110070065A (en) * 2019-04-30 2019-07-30 李冠津 The sign language systems and the means of communication of view-based access control model and speech-sound intelligent
KR102081854B1 (en) * 2019-08-01 2020-02-26 전자부품연구원 Method and apparatus for sign language or gesture recognition using 3D EDM
CN113609922A (en) * 2021-07-13 2021-11-05 中国矿业大学 Continuous sign language sentence recognition method based on mode matching
CN113780059A (en) * 2021-07-24 2021-12-10 上海大学 Continuous sign language identification method based on multiple feature points
CN113822187A (en) * 2021-09-10 2021-12-21 阿里巴巴达摩院(杭州)科技有限公司 Sign language translation, customer service, communication method, device and readable medium
CN114283493A (en) * 2021-12-09 2022-04-05 深圳市尚影视界科技有限公司 Artificial intelligence-based identification system
CN114842547A (en) * 2022-01-11 2022-08-02 南京工业大学 Sign language teaching method, device and system based on gesture action generation and recognition
CN114817465A (en) * 2022-04-14 2022-07-29 海信电子科技(武汉)有限公司 Entity error correction method and intelligent device for multi-language semantic understanding

Also Published As

Publication number Publication date
CN116151226A (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
CN110377710B (en) Visual question-answer fusion enhancement method based on multi-mode fusion
CN109670576B (en) Multi-scale visual attention image description method
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
CN111444968A (en) Image description generation method based on attention fusion
CN111858882A (en) Text visual question-answering system and method based on concept interaction and associated semantics
CN112905795A (en) Text intention classification method, device and readable medium
CN112699682A (en) Named entity identification method and device based on combinable weak authenticator
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN113779310B (en) Video understanding text generation method based on hierarchical representation network
CN112905762A (en) Visual question-answering method based on equal attention-deficit-diagram network
Pezzelle et al. Is the red square big? MALeViC: Modeling adjectives leveraging visual contexts
CN112651940A (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN116402066A (en) Attribute-level text emotion joint extraction method and system for multi-network feature fusion
CN115827954A (en) Dynamically weighted cross-modal fusion network retrieval method, system and electronic equipment
CN111046655A (en) Data processing method and device and computer readable storage medium
CN113780059A (en) Continuous sign language identification method based on multiple feature points
CN114048290A (en) Text classification method and device
CN116662924A (en) Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism
CN116737897A (en) Intelligent building knowledge extraction model and method based on multiple modes
CN116151226B (en) Machine learning-based deaf-mute sign language error correction method, equipment and medium
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN116524593A (en) Dynamic gesture recognition method, system, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant