CN112825125A

CN112825125A - Sign language recognition method and device, computer storage medium and electronic equipment

Info

Publication number: CN112825125A
Application number: CN201911147063.9A
Authority: CN
Inventors: 孙孟哲; 王佩琪
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2021-05-21

Abstract

The present disclosure relates to the technical field of artificial intelligence, and provides a sign language identification method, a sign language identification device, a computer storage medium, and an electronic device, wherein the sign language identification method comprises: extracting a gesture image contained in a sign language video to be recognized based on a target detection algorithm, and acquiring a feature vector corresponding to the gesture image; inputting the feature vector into a trained sign language recognition model, and obtaining a sign language recognition result corresponding to the gesture image according to the output of the sign language recognition model; and processing the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image. The sign language identification method can not only reduce the sign language identification cost, but also improve the identification accuracy.

Description

Sign language recognition method and device, computer storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a sign language recognition method, a sign language recognition apparatus, a computer storage medium, and an electronic device.

Background

With the rapid development of internet and computer technologies, the field of sign language recognition is also rapidly developing. Sign language identification is to compare the identified gesture image with the existing sign language in the image library by computer vision technology and to retrieve the meaning expressed by the gesture image. The technology can be combined with technologies such as voice recognition, natural language processing, voice broadcasting and the like, is applied to social scenes of deaf-mute people, and enables the deaf-mute people to communicate with people normally.

At present, the hand movements of the deaf-mute person are generally captured by means of wearable devices (such as wearable bracelets, gloves and the like) in related methods, so as to realize sign language recognition for the deaf-mute person. However, on the one hand, the operating load of the equipment and the use cost are high; on the other hand, partial sign language words need to be matched with mouth or shoulder actions, the device cannot capture the words completely, the adaptability is poor, and the recognition accuracy is low.

In view of the above, there is a need in the art to develop a new sign language recognition method and apparatus.

It is to be noted that the information disclosed in the background section above is only used to enhance understanding of the background of the present disclosure.

Disclosure of Invention

The present disclosure is directed to a sign language identification method, a sign language identification device, a computer storage medium, and an electronic device, so as to avoid the drawback of low identification accuracy of the method in the prior art at least to a certain extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the present disclosure, there is provided a sign language recognition method, including: extracting a gesture image contained in a sign language video to be recognized based on a target detection algorithm, and acquiring a feature vector corresponding to the gesture image; inputting the feature vector into a trained sign language recognition model, and obtaining a sign language recognition result corresponding to the gesture image according to the output of the sign language recognition model; and processing the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

In an exemplary embodiment of the present disclosure, after obtaining a text recognition result corresponding to the gesture image, the method further includes: and matching the text recognition result with a preset standard text, and taking the successfully matched standard text as a final recognition result of the gesture image.

In an exemplary embodiment of the present disclosure, after extracting a gesture image included in a sign language video to be recognized based on a target detection algorithm, the method further includes: classifying the gesture images according to an image classification algorithm to obtain an image classification result; and when the image classification result is a non-service type, discarding the gesture image.

In an exemplary embodiment of the present disclosure, the method further comprises: when the image classification result is the service type, performing binarization processing on the gesture image to obtain a binarization image; and carrying out morphological processing on the obtained binary image.

In an exemplary embodiment of the present disclosure, the method further comprises: adjusting the size of a file corresponding to the sign language video to be recognized into a first target numerical value; adjusting the video frame rate corresponding to the sign language video to be identified to be a second target numerical value; and converting the video format corresponding to the sign language video to be identified into a target format.

In an exemplary embodiment of the present disclosure, a gesture image sample and label information corresponding to the gesture image sample are obtained; the label information is used for identifying semantic texts corresponding to the gesture image samples; and training a machine learning model according to the gesture image sample and the label information to obtain the sign language recognition model.

In an exemplary embodiment of the present disclosure, the processing the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image includes: and performing duplication removal, error correction and/or disambiguation on the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

According to a second aspect of the present disclosure, there is provided a sign language recognition apparatus including: the extraction module is used for extracting a gesture image contained in a sign language video to be recognized based on a target detection algorithm and acquiring a feature vector corresponding to the gesture image; the recognition module is used for inputting the feature vector into a trained sign language recognition model and obtaining a sign language recognition result corresponding to the gesture image according to the output of the sign language recognition model; and the processing module is used for processing the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

According to a third aspect of the present disclosure, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the sign language recognition method of the first aspect described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the sign language recognition method of the first aspect via execution of the executable instructions.

As can be seen from the foregoing technical solutions, the sign language recognition method, the sign language recognition apparatus, the computer storage medium and the electronic device in the exemplary embodiments of the present disclosure have at least the following advantages and positive effects:

in the technical solutions provided in some embodiments of the present disclosure, on one hand, a gesture image included in a sign language video to be recognized is extracted based on a target detection algorithm, and a feature vector corresponding to the gesture image is obtained, so that redundant data in the video to be recognized can be removed, data processing amount can be reduced, data processing speed can be increased, and image information can be digitized to facilitate subsequent related recognition processing. Furthermore, the feature vectors are input into the trained sign language recognition model, and a sign language recognition result corresponding to the gesture image is obtained according to the output of the sign language recognition model, so that the accuracy of gesture recognition can be improved. And on the other hand, processing the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image. The recognition result can be more fit with the actual language environment, the readability of the text is improved, and the information acquisition efficiency of related personnel is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.

FIG. 1 is a flow diagram illustrating a sign language identification method in an exemplary embodiment of the present disclosure;

FIG. 2 shows a flow diagram of a sign language identification method in another exemplary embodiment of the present disclosure;

FIG. 3 illustrates a flow diagram of a sign language recognition method in yet another exemplary embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a sign language recognition apparatus in an exemplary embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a computer storage medium in an exemplary embodiment of the disclosure;

fig. 6 shows a schematic structural diagram of an electronic device in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

The terms "a," "an," "the," and "said" are used in this specification to denote the presence of one or more elements/components/parts/etc.; the terms "comprising" and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first" and "second", etc. are used merely as labels, and are not limiting on the number of their objects.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

At present, the hand movements of the deaf-mute person are generally captured by means of wearable devices (such as wearable bracelets, gloves and the like) in related methods, so as to realize sign language recognition for the deaf-mute person. However, on the one hand, the operating load of the equipment and the use cost are high; on the other hand, part of sign language words need to be matched with mouth or shoulder actions, the device cannot capture the words completely, and the adaptability is poor.

In the embodiment of the present disclosure, firstly, a sign language identification method is provided, which overcomes, at least to some extent, the disadvantage of the relatively high cost of the sign language identification method provided in the prior art.

Fig. 1 is a flowchart illustrating a sign language recognition method according to an exemplary embodiment of the present disclosure, where an execution subject of the sign language recognition method may be a server that recognizes a sign language.

Referring to fig. 1, a sign language recognition method according to one embodiment of the present disclosure includes the steps of:

step S110, extracting a gesture image contained in a sign language video to be recognized based on a target detection algorithm, and acquiring a feature vector corresponding to the gesture image;

step S120, inputting the feature vector into the trained sign language recognition model, and obtaining a sign language recognition result corresponding to the gesture image according to the output of the sign language recognition model;

and step S130, processing the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

In the technical solution provided in the embodiment shown in fig. 1, on one hand, the gesture image included in the sign language video to be recognized is extracted based on the target detection algorithm, the feature vector corresponding to the gesture image is obtained, redundant data in the video to be recognized can be removed, the data processing amount is reduced, the data processing speed is increased, and the image information can be digitized, which facilitates subsequent related recognition processing. Furthermore, the feature vectors are input into the trained sign language recognition model, and a sign language recognition result corresponding to the gesture image is obtained according to the output of the sign language recognition model, so that the accuracy of gesture recognition can be improved. And on the other hand, processing the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image. The recognition result can be more fit with the actual language environment, the readability of the text is improved, and the information acquisition efficiency of related personnel is improved.

The following describes the specific implementation of each step in fig. 1 in detail:

in an exemplary embodiment of the present disclosure, an exemplary application scenario is: related users (deaf-dumb people) go to places such as banks and hospitals to handle business, and the places such as the banks and the hospitals only need to be provided with a video acquisition device (such as a camera, a camera and the like), and the content which the users want to express can be identified by acquiring sign language videos of the users.

In an exemplary embodiment of the disclosure, when a sign language video, namely, a user goes to a bank to transact business, a video acquisition device of the bank shoots a picture of a series of sign language actions which the user makes in order to describe the business needing to transact. Illustratively, the sign language video of the user can be acquired through a video capture device such as a camera or a video camera. And then, the sign language video is used as the sign language video to be recognized. Illustratively, a preset time interval can be set to periodically acquire the sign language video of the user, for example: and acquiring the sign language video acquired by the video acquisition equipment every 5 s. Therefore, when no user is in front of the bank counter, the working resources of the camera can be saved; or the camera is in a dormant or closed state, and the infrared sensing technology is utilized to judge that the camera is started to enter a working mode when a user exists in front of a bank counter.

In an exemplary embodiment of the present disclosure, after the sign language video to be recognized is obtained, the file size corresponding to the sign language video may be adjusted to a first target value, specifically, a parameter of a GPU (Graphics Processing Unit, abbreviated as GPU) used in actual Processing may be used as an adjustment basis, for example: the file size of the sign language video can be adjusted to a first target value according to the capacity of the GPU or the size of the processing thread. For example, when the processing capacity of the GPU is 1024MB and the file size of the sign language video is 600MB, the file size corresponding to the sign language video may be adjusted to the first target value of 300MB, so as to increase the processing speed of the GPU.

In an exemplary embodiment of the present disclosure, after the file size of the sign language video is adjusted, the video frame rate of the sign language video may be adjusted to a second target value. The Frame rate (Frame rate) is a frequency (rate) at which bitmap images called a unit of Frame continuously appear on a display, and indicates the number of times that a graphics processor can update per second when processing graphics.

Generally, the video frame rate of the acquired sign language video is 30-60 frames, and the range of the video frame rate after adjustment (the range of the second target value) may be 10-15 frames. Specifically, the video frame rate of the hand language video can be adjusted according to the speed of the user in the gesture language action, for example, when the speed of the user in the gesture language action is faster (for example, the speed is faster than 5 gesture language actions per second), the value of the video frame rate can be set to be smaller (for example, the second target value is 10fps, 10 frames are displayed per second), so that fewer videos are skipped, the integrity of the videos can be ensured, and action omission in the videos can be avoided. When the speed of the user is slower than the speed of the gesture language action (for example, the speed is slower than 1 gesture language action per second), the value of the video frame rate can be set to be larger (for example, the second target value is 15fps, and 15 frames are displayed per second), so that the delay time length of the same gesture language action in the video can be skipped, and the playing speed and the processing speed of the video are ensured.

In an exemplary embodiment of the present disclosure, after the video frame rate of the sign language video is adjusted, the video format corresponding to the sign language video may be converted into the target format. For example, the video formats acquired by different video capture devices are different, and further, the video formats may be: an AVI (Audio Video Interleaved, abbreviated as AVI) format, an MPEG (Moving Picture Expert Group, abbreviated as MPEG) format, an MOV format, an MKV format, an MP4 format, and the like, so that the collected Video format can be uniformly converted into a target format that can be recognized by the GPU, for example: MP4 format. It should be noted that, the specific type of the target format may be set according to actual situations, and belongs to the protection scope of the present disclosure.

With continued reference to fig. 1, in step S110, a gesture image included in the sign language video to be recognized is extracted based on a target detection algorithm, and a feature vector corresponding to the gesture image is obtained.

In an exemplary embodiment of the present disclosure, after the sign language video is acquired and the above-described processing is performed on the sign language video, a plurality of gesture images included in the sign language video may be extracted based on a target detection algorithm. Specifically, a plurality of gesture images included in the sign language video can be extracted through a Single Shot multi box Detector (SSD), so that hand information included in the video can be framed, pertinence during subsequent recognition is improved, and recognition efficiency is improved.

In an exemplary embodiment of the present disclosure, referring to fig. 2, fig. 2 shows a flowchart of a sign language recognition method in another exemplary embodiment of the present disclosure, specifically shows a flowchart of acquiring a gesture image to be recognized, which includes steps S201 to S202, and the following explains a specific implementation manner with reference to fig. 2.

In step S201, the gesture images are classified according to an image classification algorithm to obtain an image classification result.

In an exemplary embodiment of the present disclosure, after the gesture images to be recognized are acquired, the gesture images to be recognized may be classified based on an image classification algorithm or an image classification model (for example, an initiation v3 model of retrain) trained in advance based on the image classification algorithm, so as to obtain an image classification result. Specifically, the image classification result may be a traffic type and a non-traffic type.

In an exemplary embodiment of the present disclosure, the service is a transaction that the user needs to transact, and the service type is a type corresponding to the transaction that the user needs to transact. For example, in a banking application scenario, the types of business that a user wants to transact may be: deposit type, withdrawal type, loan type. In a hospital application scenario, the types of traffic intended to be handled may be: registration type, payment type, etc. The non-business type may be an image made by the user corresponding to a sign language action unrelated to the business, such as: thank you, baibei, etc.

In an exemplary embodiment of the present disclosure, it should be noted that the image classification algorithm may be a KNN algorithm (K-Nearest Neighbor, abbreviated as KNN), an SVM algorithm (Support vector Machine, abbreviated as SVM), and the like, and may be set by itself according to an actual situation, and belongs to the protection scope of the present disclosure.

In step S202, when the image classification result is a non-service type, the gesture image is discarded.

In an exemplary embodiment of the present disclosure, when it is determined that the image classification result is a non-traffic type, the gesture image may be discarded. Therefore, invalid information in the gesture image can be abandoned, so that important information in the gesture image can be extracted, and the subsequent recognition efficiency is improved.

In an exemplary embodiment of the present disclosure, when the image classification result corresponding to the gesture image is a service type, the plurality of gesture images may be subjected to binarization processing to obtain a plurality of binarized images. Specifically, the image binarization is to set the gray value of a pixel point on the image to be 0 or 255, so as to make the whole image exhibit an obvious black-and-white effect. Thus, the interested part of the image can be reserved to the maximum extent.

In an exemplary embodiment of the present disclosure, after obtaining a plurality of binarization processes, morphological processing may be performed on the plurality of binarization images, and specifically, morphological processing may be performed on the binarization images based on OpenCV (which is a cross-platform computer vision library derived from a BSD license and provides various programming language interfaces such as Python, C + +, Java, Matlab, and the like, integrates a number of computer vision algorithms, and has a very strong function), so that multi-core processing may be supported, and a processing speed may be increased. Further, a plurality of images obtained after the morphological processing may be used as gesture images to be recognized.

In exemplary embodiments of the present disclosure, the morphological treatment may include an erosion, dilation treatment. Wherein, the connected boundary is eliminated during the corrosion treatment, the boundary is contracted inwards, different objects which are adhered together can be separated, and the small particle noise can be removed. The dilation process is to merge the target point into the background and expand the target point to the outside. The expansion treatment can combine the broken targets, so that the whole target can be conveniently extracted. The corrosion expansion treatment can make the hand action more prominent, improve the image definition, thereby improving the identification accuracy when subsequently carrying out the relevant identification.

In an exemplary embodiment of the present disclosure, after the morphological processing is performed on the binarized image, a plurality of feature vectors corresponding to the plurality of gesture images may be acquired. For example, the gesture image may be processed based on a neural network algorithm or the like to obtain a plurality of feature vectors corresponding to the gesture image. Therefore, the image features can be digitalized, and the subsequent relevant identification processing is convenient.

With reference to fig. 1, in step S120, the feature vector is input into the trained sign language recognition model, and a sign language recognition result corresponding to the gesture image is obtained according to the output of the sign language recognition model.

In an exemplary embodiment of the present disclosure, a gesture recognition model may be obtained by pre-training, specifically, referring to fig. 3, fig. 3 shows a flowchart of a sign language recognition method in yet another exemplary embodiment of the present disclosure, and specifically shows a flowchart of obtaining the gesture recognition model according to pre-training, which includes steps S301 to S302, and the following describes a specific implementation manner with reference to fig. 3.

In step S301, a gesture image sample and label information corresponding to the gesture image sample are obtained; the label information is used for identifying semantic texts corresponding to the gesture image samples.

In an exemplary embodiment of the present disclosure, the gesture image sample may be a mass of gesture images obtained in advance and tag information corresponding to the gesture sample image, where the tag information is used to identify a semantic text corresponding to the gesture image sample. For example, the label information corresponding to the gesture image sample "make a fist and extend out of the thumb" may be "withdraw money", and the label information corresponding to the gesture image sample "make a fist and extend out of the index finger to point to the other side" may be "deposit money", and the like.

In step S302, a machine learning model is trained according to the gesture image samples and the label information to obtain a sign language recognition model.

In an exemplary embodiment of the present disclosure, after the gesture image sample and the tag information are obtained, a machine learning model may be trained according to the gesture image sample and the tag information, specifically, the gesture image sample and the tag information may be input into the machine learning model, for example: RNN (Recurrent Neural Network, for Short: RNN) model, LSTM (Long Short Term Memory Network, for Short Term Memory Network) and the like, and parameters are adjusted for many times to train the machine learning model, so that the loss function of the machine learning model tends to converge, and a hand language recognition model is obtained.

In an exemplary embodiment of the present disclosure, after the sign language recognition model is trained, the plurality of feature vectors may be input into the sign language recognition model, so as to perform recognition processing on the feature vectors through the sign language recognition model, and obtain a sign language recognition result corresponding to the gesture image according to an output of the sign language recognition model.

In step S130, the sign language recognition result is processed according to the natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

In an exemplary embodiment of the disclosure, after obtaining a sign language recognition result corresponding to the gesture image, the sign language recognition result may be processed according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

In the exemplary embodiment of the present disclosure, a Natural Language Processing (NLP) algorithm is an important direction in the fields of computer science and artificial intelligence, and can realize effective communication between a person and a computer by using natural language.

In an exemplary embodiment of the disclosure, specifically, the sign language recognition result may be subjected to deduplication, error correction, and/or disambiguation processing according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

In an exemplary embodiment of the present disclosure, the deduplication processing is to remove the repeated language in the sign language recognition result, and for example, when the sign language recognition result is "i am i want to withdraw money", the deduplication processing may be performed on the sign language recognition result based on a natural language processing algorithm, so as to obtain a text recognition result "i am want to withdraw money".

In an exemplary embodiment of the present disclosure, the error correction process is a process of correcting the error information in the sign language recognition result. For example, when the sign language recognition result is "i want to collect money", and the text corresponding to the sign language recognition result displayed on the relevant intermediate processing page (the display page before final output is not performed) is "i want to collect money", the error correction processing may be performed on the sign language recognition result based on the natural language processing algorithm, so as to obtain a text recognition result "i want to collect money".

In an exemplary embodiment of the present disclosure, the disambiguation process is a process of disambiguating text ambiguous in the sign language recognition result. For example, when the sign language recognition result is "i want to save money", the sign language recognition result may be disambiguated based on a natural language processing algorithm to obtain a text recognition result "i want to save money".

In an exemplary embodiment of the disclosure, after the sign language recognition result is processed based on natural language processing to obtain a text recognition result corresponding to the gesture image, the obtained text recognition result may be matched with a preset standard text, and the standard text which is successfully matched is used as a final recognition result of the gesture image. For example, when the text recognition result is "i may not save money", the text recognition result may be matched with a preset standard text, and if the standard text that is successfully matched is "i want to save money", the "i want to save money" may be determined as the final recognition result of the gesture image. Therefore, the identification result can better accord with the corresponding service scene.

The present disclosure also provides a sign language recognition apparatus, and fig. 4 shows a schematic structural diagram of the sign language recognition apparatus in an exemplary embodiment of the present disclosure; as shown in fig. 4, the sign language recognition apparatus 400 may include an extraction module 401, a recognition module 402, and a processing module 403. Wherein:

the extracting module 401 is configured to extract a gesture image included in the sign language video to be recognized based on a target detection algorithm, and obtain a feature vector corresponding to the gesture image.

In an exemplary embodiment of the disclosure, the obtaining module is configured to classify the gesture image according to an image classification algorithm to obtain an image classification result; and when the image classification result is a non-service type, discarding the gesture image.

In an exemplary embodiment of the disclosure, the obtaining module is configured to, when the image classification result is a service type, perform binarization processing on the gesture image to obtain a binarized image; and performing morphological processing on the obtained binary image.

In an exemplary embodiment of the disclosure, the obtaining module is configured to adjust a file size corresponding to a sign language video to be recognized to a first target value; adjusting the video frame rate corresponding to the sign language video to be identified to be a second target numerical value; and converting the video format corresponding to the sign language video to be identified into a target format.

And the recognition module 402 is configured to input the feature vector into the trained sign language recognition model, and obtain a sign language recognition result corresponding to the gesture image according to the output of the sign language recognition model.

In an exemplary embodiment of the present disclosure, the recognition module is configured to obtain a gesture image sample and label information corresponding to the gesture image sample; the label information is used for identifying semantic texts corresponding to the gesture image samples; and training a machine learning model according to the gesture image samples and the label information to obtain a sign language recognition model.

And the processing module 403 is configured to process the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

In an exemplary embodiment of the disclosure, the processing module is configured to match the text recognition result with a preset standard text, and use the standard text successfully matched as a final recognition result of the gesture image.

In an exemplary embodiment of the disclosure, the processing module is configured to perform deduplication, error correction, and/or disambiguation processing on the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

The specific details of each module in the sign language recognition device have been described in detail in the corresponding sign language recognition method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer storage medium capable of implementing the above method. On which a program product capable of implementing the above-described method of the present specification is stored. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.

Referring to fig. 5, a program product 500 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 600 according to this embodiment of the disclosure is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: the at least one processing unit 610, the at least one memory unit 620, and a bus 630 that couples the various system components including the memory unit 620 and the processing unit 610.

Wherein the storage unit stores program code that is executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present disclosure as described in the above section "exemplary methods" of this specification. For example, the processing unit 610 may perform the following as shown in fig. 1: step S110, extracting a gesture image contained in a sign language video to be recognized based on a target detection algorithm, and acquiring a feature vector corresponding to the gesture image; step S120, inputting the feature vector into a trained sign language recognition model, and obtaining a sign language recognition result corresponding to the gesture image according to the output of the sign language recognition model; and step S130, processing the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.

The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. As shown, the network adapter 660 communicates with the other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A sign language identification method, comprising:

extracting a gesture image contained in a sign language video to be recognized based on a target detection algorithm, and acquiring a feature vector corresponding to the gesture image;

inputting the feature vector into a trained sign language recognition model, and obtaining a sign language recognition result corresponding to the gesture image according to the output of the sign language recognition model;

and processing the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

2. The method according to claim 1, wherein after obtaining the text recognition result corresponding to the gesture image, the method further comprises:

and matching the text recognition result with a preset standard text, and taking the successfully matched standard text as a final recognition result of the gesture image.

3. The method according to claim 1 or 2, wherein after extracting gesture images contained in the sign language video to be recognized based on an object detection algorithm, the method further comprises:

classifying the gesture images according to an image classification algorithm to obtain an image classification result;

and when the image classification result is a non-service type, discarding the gesture image.

4. The method of claim 3, further comprising:

when the image classification result is the service type, performing binarization processing on the gesture image to obtain a binarization image;

and carrying out morphological processing on the obtained binary image.

5. The method of claim 4, further comprising:

adjusting the size of a file corresponding to the sign language video to be recognized into a first target numerical value;

adjusting the video frame rate corresponding to the sign language video to be identified to be a second target numerical value;

and converting the video format corresponding to the sign language video to be identified into a target format.

6. The method of claim 1, further comprising:

acquiring a gesture image sample and label information corresponding to the gesture image sample; the label information is used for identifying semantic texts corresponding to the gesture image samples;

and training a machine learning model according to the gesture image sample and the label information to obtain the sign language recognition model.

7. The method according to claim 1, wherein the processing the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image comprises:

and performing duplication removal, error correction and/or disambiguation on the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

8. A sign language recognition apparatus, comprising:

the extraction module is used for extracting a gesture image contained in a sign language video to be recognized based on a target detection algorithm and acquiring a feature vector corresponding to the gesture image;

the recognition module is used for inputting the feature vector into a trained sign language recognition model and obtaining a sign language recognition result corresponding to the gesture image according to the output of the sign language recognition model;

and the processing module is used for processing the sign language recognition result according to a natural language processing algorithm to obtain a text recognition result corresponding to the gesture image.

9. A computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a sign language recognition method according to any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the sign language identification method of any one of claims 1 to 7 via execution of the executable instructions.