CN115830711A

CN115830711A - Sign language vocabulary recognition method, system, device and medium based on deep learning

Info

Publication number: CN115830711A
Application number: CN202211500177.9A
Authority: CN
Inventors: 张昊; 刘增辉; 林立新; 孙意翔; 肖婴然; 李昆霖
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-03-21

Abstract

The invention discloses a sign language vocabulary identification method, a system, equipment and a medium based on deep learning, wherein the method comprises the steps of obtaining a sign language video; inputting the sign language video into a trained human body posture estimation network model for first feature extraction to obtain a heatmap image in the sign language video; performing second feature extraction through a feature rapid screening model based on time sequence light weight to obtain a heatmap spatial feature; carrying out spatial feature screening on the heatmap spatial features of the human key point information to obtain the human key point spatial features; performing feature learning through a bidirectional LSTM time sequence model with an attention mechanism to obtain a sign language video learning result; classifying and coding through a full connection layer and a softmax layer to obtain a sign language video classification coding result; and inquiring to obtain a sign language vocabulary recognition result according to the sign language video classification coding result. The invention can improve the accuracy of sign language recognition.

Description

Sign language vocabulary recognition method, system, device and medium based on deep learning

Technical Field

The invention relates to the technical field of sign language recognition, in particular to a sign language vocabulary recognition method, a system, equipment and a medium based on deep learning.

Background

The existing sign language vocabulary recognition model mainly adopts coordinate positions based on RGB modal information and skeleton key points for recognition, and deeper neural networks are needed for effectively extracting sign language action information from the RGB information, so that the calculated amount of the model is increased, and the real-time effect is difficult to achieve.

Most of existing networks based on sequence models such as LSTM, transform and the like do not pre-screen video redundant information on a time sequence before input, but directly enable the sequence networks to learn effective characteristics, generally, differences between frames in a 30fps video sequence are very small, in order to reduce the redundant information from the video, the sequence information is quickly compressed before the sequence models are input, and the learning capability of the models can be reduced by the technical method, so that the accuracy of sign language recognition is reduced.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a sign language vocabulary recognition method, a system, equipment and a medium based on deep learning, which can improve the learning capability of a model and the accuracy of sign language recognition.

In a first aspect, an embodiment of the present invention provides a method for recognizing a sign language vocabulary based on deep learning, where the method for recognizing the sign language vocabulary based on deep learning includes:

acquiring a sign language video;

inputting the sign language video into a trained human body posture estimation network model for first feature extraction, and obtaining a heatmap image in the sign language video output by the human body posture estimation network model;

performing second feature extraction on the heatmap image through a time-sequence lightweight feature-based rapid feature screening model to obtain heatmap spatial features;

carrying out spatial feature screening on the heatmap spatial features of the human key point information to obtain the human key point spatial features;

performing feature learning on the spatial features of the human key points through a bidirectional LSTM time sequence model with an attention mechanism to obtain a sign language video learning result output by the bidirectional LSTM time sequence model with the attention mechanism;

classifying and coding the sign language video learning result through a full connection layer and a softmax layer to obtain a sign language video classification coding result;

and inquiring to obtain a sign language vocabulary recognition result according to the sign language video classification coding result.

Compared with the prior art, the first aspect of the invention has the following beneficial effects:

the method comprises the steps of obtaining a sign language video; inputting the sign language video into a trained human body posture estimation network model for first feature extraction to obtain a heatmap image in the sign language video output by the human body posture estimation network model; and performing second feature extraction on the heatmap graph through a time-sequence lightweight-based feature rapid screening model to obtain the heatmap spatial features. The method can improve the speed of sign language recognition and reduce the calculation amount by carrying out feature extraction on the heatmap through a time-sequence light-weight feature rapid screening model. Carrying out spatial feature screening on the heatmap spatial features of the human key point information to obtain the human key point spatial features; carrying out feature learning on the spatial features of the key points of the human body through a bidirectional LSTM time sequence model with an attention mechanism to obtain a sign language video learning result output by the bidirectional LSTM time sequence model with the attention mechanism; classifying and coding the sign language video learning result through a full connection layer and a softmax layer to obtain a sign language video classification coding result; and inquiring to obtain a sign language vocabulary recognition result according to the sign language video classification coding result. According to the method, the learning capacity of the model can be improved through feature screening and learning, and the feature extraction and feature learning are carried out on the basis of the heatmap by extracting the heatmap in the sign language video instead of directly extracting the coordinate information of the key point position of the human body as the feature information, so that the accuracy of embedding the motion into the feature information is increased, and the accuracy of sign language identification can be improved.

According to some embodiments of the present invention, before the inputting the sign language video into the trained human body posture estimation network model for the first feature extraction, the method for sign language vocabulary recognition based on deep learning further comprises:

presetting the number of ideal frames in the sign language video;

and if the number of frames in the sign language video is less than the number of ideal frames, performing blank frame filling on the number of ideal frames to obtain a processed sign language video, wherein the number of frames in the processed sign language video is greater than or equal to the number of ideal frames.

According to some embodiments of the invention, the selecting of the human body key points comprises: a plurality of key points on the nose, eyes, ears, arms and fingers are selected on the human body.

According to some embodiments of the present invention, the obtaining a heatmap spatial feature by performing a second feature extraction on the heatmap through a time-series lightweight feature fast screening model includes:

inputting the heatmap map through multiple channels corresponding to the video frame sequence;

performing two-dimensional adaptive average pooling, full-link layer activation and Relu activation function activation on the heatmap graph to obtain a global heatmap time sequence characteristic;

performing first grouping convolution, batch normalization, relu activation function activation and second grouping convolution on the heatmap graph in a time sequence dimension to obtain a local heatmap time sequence characteristic;

adding the global heatmap timing features and the local heatmap timing features to obtain added heatmap timing features;

activating the added heatmap time sequence feature by adopting a Sigmoid activation function to generate a frame weight corresponding to the added heatmap time sequence feature;

and correspondingly multiplying the frame weight value by the input heatmap graph to obtain a heatmap spatial feature.

According to some embodiments of the present invention, the spatial feature screening of the heatmap spatial feature on the human key point information to obtain the human key point spatial feature includes:

combining a time sequence dimension T in the heatmap spatial feature with the size of BxTxCxHxW and a channel dimension C of the key point of the human body to obtain the heatmap spatial feature with the size of Bx (T x C) xHxW, wherein B represents the number of data items calculated at one time during training or reasoning data, H represents the height of the heatmap, and W represents the width of the heatmap;

performing first grouping convolution on the heatmap spatial feature with the size of Bx (T × C) xHxW by taking the channel dimension C of the key point of the human body as a group to obtain the heatmap spatial feature with the size of BxCxHxW;

performing batch normalization, relu activation function activation and second grouping convolution by taking the channel dimension C of the key points of the human body as a group on the heatmap spatial feature with the size of BxCxHxW to obtain a local heatmap spatial feature;

performing adaptive average pooling and convolution on the heatmap spatial feature with the size of Bx (T × C) xHxW to obtain a global heatmap spatial feature;

carrying out numerical value corresponding multiplication on the local heatmap spatial feature and the global heatmap spatial feature to obtain a multiplied heatmap spatial feature;

and activating the multiplied heatmap spatial features through a third grouping convolution and a Mish activation function which are performed by taking the channel dimension C of the human body key points as a grouping to obtain the human body key point spatial features.

According to some embodiments of the present invention, before the feature learning of the spatial features of the human key points through the bidirectional LSTM time series model with an attention mechanism, the method for recognizing sign language vocabulary based on deep learning further includes:

and carrying out Dropout random inactivation and full-connection layer dimension reduction treatment on the spatial characteristics of the key points of the human body.

According to some embodiments of the present invention, the querying a sign language vocabulary recognition result according to the sign language video classification coding result includes:

generating a sign language vocabulary coding table according to a sign language text corresponding to the sign language video;

and inquiring the sign language vocabulary coding table according to the sign language video classification coding result to obtain a predicted sign language vocabulary recognition result.

In a second aspect, an embodiment of the present invention further provides a deep learning-based sign language vocabulary recognition system, where the deep learning-based sign language vocabulary recognition system includes:

the data acquisition unit is used for acquiring a sign language video;

a first feature extraction unit, configured to input the sign language video into a trained human body posture estimation network model for first feature extraction, so as to obtain a heatmap map in the sign language video output by the human body posture estimation network model;

the second feature extraction unit is used for carrying out second feature extraction on the heatmap through a time-sequence lightweight-based feature rapid screening model to obtain heatmap spatial features;

the feature screening unit is used for screening the feature spatial features of the heatmap spatial features according to the human key point information to obtain the human key point spatial features;

the feature learning unit is used for performing feature learning on the spatial features of the key points of the human body through a bidirectional LSTM time sequence model with an attention mechanism to obtain a sign language video learning result output by the bidirectional LSTM time sequence model with the attention mechanism;

the result acquisition unit is used for classifying and encoding the sign language video learning result through a full connection layer and a softmax layer to obtain a sign language video classification encoding result;

and the vocabulary identification unit is used for inquiring to obtain a sign language vocabulary identification result according to the sign language video classification coding result.

In a third aspect, an embodiment of the present invention further provides a sign language vocabulary recognition device based on deep learning, including at least one control processor and a memory, wherein the memory is used for being connected with the at least one control processor in communication; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform a deep learning based sign language vocabulary recognition method as described above.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to cause a computer to execute a sign language vocabulary recognition method based on deep learning as described above.

It is to be understood that the advantageous effects of the second aspect to the fourth aspect compared to the related art are the same as the advantageous effects of the first aspect compared to the related art, and reference may be made to the related description of the first aspect, which is not repeated herein.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flowchart of a method for sign language vocabulary recognition based on deep learning according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for sign language vocabulary recognition based on deep learning according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of human key point definition in accordance with an embodiment of the present invention;

FIG. 4 is a diagram illustrating key point definition between fingers according to an embodiment of the present invention;

fig. 5 is a block diagram of a deep learning based sign language vocabulary recognition system according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the description of the present invention, if there are first, second, etc. described, it is only for the purpose of distinguishing technical features, and it is not understood that relative importance is indicated or implied or that the number of indicated technical features is implicitly indicated or that the precedence of the indicated technical features is implicitly indicated.

In the description of the present invention, it should be understood that the orientation descriptions, such as the orientation or positional relationship indicated by upper, lower, etc., are based on the orientation or positional relationship shown in the drawings, and are only for convenience of description and simplification of the description, but do not indicate or imply that the device or element referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus should not be construed as limiting the present invention.

In the description of the present invention, it should be noted that unless otherwise explicitly defined, terms such as setup, installation, connection, etc. should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention by combining the detailed contents of the technical solutions.

First, several terms referred to in the present application are resolved:

sign language vocabulary recognition: sign language identification is the conversion of a video sequence of the motion of a person taking an independent sign language into corresponding lexical text. It can be understood as a classification task in deep learning, if there are 500 vocabularies, the input corresponding to the model is sign language video, and the output is word category.

In order to solve the problems, the invention obtains the sign language video; inputting the sign language video into a trained human body posture estimation network model for first feature extraction to obtain a heatmap image in the sign language video output by the human body posture estimation network model; and performing second feature extraction on the heatmap by a time-sequence lightweight feature-based rapid feature screening model to obtain the heatmap spatial features. The invention can improve the speed of sign language recognition and reduce the calculation amount by carrying out feature extraction on the heatmap through the time-sequence light-weight feature rapid screening model. Carrying out spatial feature screening on the heatmap spatial features of the human key point information to obtain the human key point spatial features; carrying out feature learning on the spatial features of the key points of the human body through a bidirectional LSTM time sequence model with an attention mechanism to obtain a sign language video learning result output by the bidirectional LSTM time sequence model with the attention mechanism; classifying and coding the sign language video learning result through a full connection layer and a softmax layer to obtain a sign language video classification coding result; and inquiring to obtain a sign language vocabulary recognition result according to the sign language video classification coding result. According to the method, the learning capacity of the model can be improved through feature screening and learning, and the feature extraction and feature learning are carried out on the basis of the heatmap by extracting the heatmap in the sign language video instead of directly extracting the coordinate information of the key point position of the human body as the feature information, so that the accuracy of embedding the motion into the feature information is increased, and the accuracy of sign language identification can be improved.

Referring to fig. 1 to 2, an embodiment of the present invention provides a method for recognizing a sign language vocabulary based on deep learning, where the method for recognizing the sign language vocabulary based on deep learning includes:

and S100, acquiring a sign language video.

Specifically, the number of ideal frames in the sign language video is preset;

and if the number of the frames in the sign language video is less than the number of the ideal frames, performing blank frame filling on the number of the ideal frames to obtain the processed sign language video, wherein the number of the frames in the processed sign language video is greater than or equal to the number of the ideal frames. For example:

the sign language video is input, the sign language video comprises a plurality of frames, and because the number of the frames corresponding to different sign language videos is different, corresponding sampling is firstly carried out in the embodiment, the number of the frames in each sign language video is ensured to be fixed N frames, wherein N is a fixed ideal frame number value, and blank frames are directly substituted and filled in the sign language video with insufficient N frames until the N frames.

And S200, inputting the sign language video into the trained human body posture estimation network model for first feature extraction to obtain a heatmap image in the sign language video output by the human body posture estimation network model.

Specifically, in the embodiment, a pre-trained human body posture estimation network model is adopted to perform first feature extraction on the sign language video, so as to obtain a heatmap image in the sign language video output by the human body posture estimation network model. In this embodiment, sign language videos in the rgb mode are input into the human posture estimation network, and the human posture estimation network model calculates a corresponding heatmap by performing operation inference such as convolution on key point information of a human body, where the heatmap can be converted into 2d coordinates (x, y) of a certain key point position through calculation, but the embodiment does not calculate the 2d coordinates, and only calculates the heatmap.

In this embodiment, how many human body key points correspond to how many heatmap maps, wherein the selection of the human body key points includes: a plurality of key points on the nose, eyes, ears, arms and fingers are selected on the human body. The method specifically comprises the following steps:

regarding the selection of key points, referring to fig. 3 to 4, the present embodiment selects 42 points on the nose, eyes, ears, arms, and fingers shown in fig. 3 and 4 for the sign language task. The specific definitions are the reference numbers in the figures: 0,1,2,3,4,5,6,7,8,9,10,17,18,94,97,98,99,100,102,103,104,106,107,108,110,111,112,114, 115,118,119,120,121,123,124,125,127,128,129,131,132,133,135. Regarding the inter-finger sampling problem, sampling is not performed because the degree of basic deformation of the finger's upper half joint points, for example, 101 and 105 points among the finger's upper half joint points, is small.

It should be noted that, in the pre-training process of the pre-trained Human body posture Estimation network model of this embodiment, reference may be made to the content in the website "https:// githu, com/leoxaobin/deep-High-Resolution-Net", where the content of the website includes the HRNet network model (High-Resolution Net), the HRNet is proposed for the 2D Human body posture Estimation (Human position Estimation or Keypoint Detection) task, and the network mainly aims at posture Estimation of a single individual (that is, only one Human body target should be in the image input to the network). Human body posture estimation is also more applied in the current scenes, such as human body behavior and action recognition, human-computer interaction (for example, a human body makes a certain action to trigger a system to execute certain tasks), animation (for example, an action corresponding to a cartoon character is generated according to key point information of the human body), and the like. The training process of the pre-trained human body posture estimation network model is not described in detail in this embodiment.

In this embodiment, coordinate information of a position of a human body key point is not directly extracted as feature information, but a heatmap before a position of a coordinate is finally output by a key point detection network is used as the feature information, so that on one hand, accuracy of action embedding feature information is increased, on the other hand, acquisition of sign language is improved, and thus accuracy of sign language vocabulary recognition is improved.

And step S300, performing second feature extraction on the heatmap through the time-sequence lightweight feature-based rapid feature screening model to obtain the heatmap spatial feature.

Specifically, a second feature extraction is performed on the heatmap graph through a time sequence lightweight-based feature rapid screening model, so as to obtain a heatmap spatial feature, which specifically comprises the following steps:

inputting a heatmap map by multiple channels corresponding to the video frame sequence;

performing two-dimensional self-adaptive average pooling, full-link layer and Relu activation function activation on the heatmap graph to obtain global heatmap time sequence characteristics;

performing first grouping convolution, batch normalization, relu activation function activation and second grouping convolution on the heatmap in a time sequence dimension to obtain a local heatmap time sequence characteristic;

adding the global heatmap timing characteristics and the local heatmap timing characteristics to obtain added heatmap timing characteristics;

activating the added heatmap time sequence characteristics by adopting a Sigmoid activation function to generate a frame weight corresponding to the added heatmap time sequence characteristics;

and correspondingly multiplying the frame weight by the input heatmap image to obtain the heatmap spatial feature.

In this embodiment, a feature fast screening model which can be learned by parameters in advance and is based on time sequence light weight is adopted, and heatmap information of each frame is added with a corresponding frame weight value on a sequence dimension of a time sequence feature and then is input into a bidirectional LSTM time sequence model with an attribute mechanism, so that the learning capability of the model can be improved more efficiently, and the effect of sign language identification is improved.

And S400, screening the spatial features of the heatmap spatial features of the human key point information to obtain the human key point spatial features.

Specifically, spatial feature screening of human key point information is performed on the heatmap spatial features to obtain human key point spatial features, which specifically include:

combining a time sequence dimension T in the heatmap spatial feature with the size of BxTxCxHxW and a channel dimension C of a human key point to obtain the heatmap spatial feature with the size of Bx (T & ltC) xHxW, wherein B represents the number of data items calculated at one time during training or reasoning data, H represents the height of the heatmap, and W represents the width of the heatmap;

performing batch normalization, relu activation function activation and second grouping convolution by taking the channel dimension C of the key point of the human body as a group on the heatmap spatial feature with the size of BxCxHxW to obtain a local heatmap spatial feature;

And S500, performing feature learning on the spatial features of the key points of the human body through a bidirectional LSTM time sequence model with an attention mechanism to obtain a sign language video learning result output by the bidirectional LSTM time sequence model with the attention mechanism.

Specifically, the spatial features of the human key points obtained in step S400 are inactivated randomly with a probability of 33% by a Dropout function, then subjected to dimension reduction processing by a full connection layer, and subjected to feature learning by a bidirectional LSTM time series model with an attention mechanism, so as to obtain a sign language video learning result output by the bidirectional LSTM time series model with the attention mechanism.

And S600, classifying and coding the sign language video learning result through a full connection layer and a softmax layer to obtain a sign language video classification coding result.

Specifically, the sign language video learning result obtained in step S500 is classified and encoded through the full link layer and the softmax layer, and a sign language video classification encoding result is obtained, where the sign language video classification encoding result includes one-hot encoding of the sign language video classification category.

And S700, inquiring to obtain a sign language vocabulary recognition result according to the sign language video classification coding result.

Specifically, according to the sign language video classification coding result, the sign language vocabulary recognition result is obtained by querying, and the method specifically comprises the following steps:

generating a sign language vocabulary coding table according to a sign language text corresponding to the sign language video, wherein the coding table is a vocabulary one-hot coding table;

and inquiring one-hot codes in the sign language vocabulary coding table according to the one-hot codes in the sign language video classification coding result to obtain a predicted sign language vocabulary recognition result.

In the embodiment, the embodiment obtains the sign language video; inputting the sign language video into a trained human body posture estimation network model for first feature extraction to obtain a heatmap image in the sign language video output by the human body posture estimation network model; and performing second feature extraction on the heatmap by a time-sequence lightweight feature-based rapid feature screening model to obtain the heatmap spatial features. In the embodiment, the feature extraction is performed on the heatmap by the time-series light-weight feature rapid screening model, so that the sign language recognition speed can be increased, and the calculation amount can be reduced. Carrying out spatial feature screening on the heatmap spatial features of the human key point information to obtain the human key point spatial features; the method comprises the steps of carrying out feature learning on the spatial features of the human key points through a bidirectional LSTM time sequence model with an attention mechanism, and obtaining a sign language video learning result output by the bidirectional LSTM time sequence model with the attention mechanism. Classifying and coding the sign language video learning result through a full connection layer and a softmax layer to obtain a sign language video classification coding result; and inquiring to obtain a sign language vocabulary recognition result according to the sign language video classification coding result. According to the embodiment, the learning capacity of the model can be improved through feature screening and learning, and the feature extraction and feature learning are carried out on the basis of the heatmap by extracting the heatmap in the sign language video instead of directly extracting the coordinate information of the key point position of the human body as the feature information, so that the accuracy of the action embedded feature information is increased, and the accuracy of sign language identification can be improved.

Referring to fig. 5, an embodiment of the present invention further provides a deep learning-based sign language vocabulary recognition system, which includes a data obtaining unit 100, a first feature extracting unit 200, a second feature extracting unit 300, a feature screening unit 400, a feature learning unit 500, a result obtaining unit 600, and a vocabulary recognition unit 700, where:

a data acquisition unit 100 for acquiring a sign language video;

a first feature extraction unit 200, configured to input the sign language video into a trained human body posture estimation network model for first feature extraction, so as to obtain a heatmap image in the sign language video output by the human body posture estimation network model;

a second feature extraction unit 300, configured to perform second feature extraction on the heatmap image through a time-series lightweight feature-based rapid feature screening model, so as to obtain a heatmap spatial feature;

the feature screening unit 400 is configured to perform spatial feature screening on the heatmap spatial features according to the human key point information to obtain human key point spatial features;

the feature learning unit 500 is used for performing feature learning on the spatial features of the key points of the human body through the bidirectional LSTM time sequence model with the attention mechanism to obtain a sign language video learning result output by the bidirectional LSTM time sequence model with the attention mechanism;

a result obtaining unit 600, configured to classify and encode the sign language video learning result through the full connection layer and the softmax layer, and obtain a sign language video classification encoding result;

and the vocabulary recognition unit 700 is configured to obtain a sign language vocabulary recognition result by querying according to the sign language video classification coding result.

It should be noted that, since the sign language vocabulary recognition system based on deep learning in the embodiment is based on the same inventive concept as the above-mentioned sign language vocabulary recognition method based on deep learning, the corresponding contents in the method embodiment are also applicable to the embodiment of the system, and will not be described in detail here.

The embodiment of the invention also provides sign language vocabulary recognition equipment based on deep learning, which comprises: at least one control processor and a memory for communicative connection with the at least one control processor.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement a deep learning based sign language vocabulary recognition method of the above embodiments are stored in a memory, and when executed by a processor, perform a deep learning based sign language vocabulary recognition method of the above embodiments, for example, performing the above-described method steps S100 to S700 in fig. 1.

The above described system embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions, which are executed by one or more control processors, and can cause the one or more control processors to execute a deep learning based sign language vocabulary recognition method in the above method embodiments, for example, to execute the functions of the method steps S100 to S700 in fig. 1 described above.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the preferred embodiments of the present invention have been described in detail, it will be understood, however, that the invention is not limited to those precise embodiments, and that various other modifications and substitutions may be affected therein by one skilled in the art without departing from the scope of the invention.

Claims

1. A sign language vocabulary recognition method based on deep learning is characterized by comprising the following steps:

acquiring a sign language video;

2. The method for recognizing sign language vocabulary based on deep learning of claim 1, wherein before the inputting the sign language video into the trained human posture estimation network model for the first feature extraction, the method for recognizing sign language vocabulary based on deep learning further comprises:

presetting the number of ideal frames in the sign language video;

3. The sign language vocabulary recognition method based on deep learning of claim 1, wherein the selection of the human body key points comprises: a plurality of key points on the nose, eyes, ears, arms and fingers are selected on the human body.

4. The method for recognizing sign language vocabulary based on deep learning as claimed in claim 1, wherein the obtaining a heatmap spatial feature by performing a second feature extraction on the heatmap through a time-series lightweight feature fast screening model comprises:

5. The sign language vocabulary recognition method based on deep learning of claim 4, wherein the spatial feature screening of the heatmap spatial features on the human key point information to obtain the human key point spatial features comprises:

6. The method for recognizing sign language vocabulary based on deep learning of claim 1, wherein before the feature learning of the spatial features of the human key points through the bidirectional LSTM time series model with attention mechanism, the method for recognizing sign language vocabulary based on deep learning further comprises:

7. The method for recognizing sign language vocabulary based on deep learning according to claim 1, wherein the querying a sign language vocabulary recognition result according to the sign language video classification coding result comprises:

8. A deep learning based sign language vocabulary recognition system, comprising:

the data acquisition unit is used for acquiring a sign language video;

the first feature extraction unit is used for inputting the sign language video into a trained human body posture estimation network model for first feature extraction to obtain a heatmap image in the sign language video output by the human body posture estimation network model;

the feature learning unit is used for performing feature learning on the spatial features of the human key points through a bidirectional LSTM time sequence model with an attention mechanism to obtain a sign language video learning result output by the bidirectional LSTM time sequence model with the attention mechanism;

9. A sign language vocabulary recognition device based on deep learning, comprising at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the deep learning based sign language vocabulary recognition method of any of claims 1 to 7.

10. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the deep learning based sign language vocabulary recognition method according to any one of claims 1 to 7.