CN115937968A - Sign language action recognition method and discretization coding model training method - Google Patents

Sign language action recognition method and discretization coding model training method Download PDF

Info

Publication number
CN115937968A
CN115937968A CN202211304075.XA CN202211304075A CN115937968A CN 115937968 A CN115937968 A CN 115937968A CN 202211304075 A CN202211304075 A CN 202211304075A CN 115937968 A CN115937968 A CN 115937968A
Authority
CN
China
Prior art keywords
sign language
recognized
image frame
language action
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211304075.XA
Other languages
Chinese (zh)
Inventor
王琪
张邦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211304075.XA priority Critical patent/CN115937968A/en
Publication of CN115937968A publication Critical patent/CN115937968A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The application discloses a sign language action recognition method and a discretization coding model training method. Wherein, the method comprises the following steps: collecting an image frame set in the process of outputting sign language actions to be recognized by a biological object; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; and classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized. The method and the device solve the technical problem that the recognition accuracy of the hand actions in the related technology is low, and improve the interaction experience.

Description

Sign language action recognition method and discretization coding model training method
Technical Field
The application relates to the field of data processing, in particular to a sign language action recognition method and a discretization coding model training method.
Background
At present, sign language translation can help auditory handicapped people to communicate with people better, sign language can be converted into natural language by utilizing sign language translation, wherein a sign language recognition task is the first step of sign language translation, the aim is to recognize words corresponding to each sign language in an input video, but the recognition accuracy of a recognition mode adopted at present on the sign language action in the video is low.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the application provides a sign language action recognition method and a discretization coding model training method, so as to at least solve the technical problem of low recognition accuracy of the sign language action in the related technology.
According to an aspect of an embodiment of the present application, there is provided a method for recognizing a sign language action, including: collecting an image frame set in the process of outputting sign language actions to be recognized by a biological object; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; and classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized.
According to an aspect of the embodiments of the present application, there is provided a training method of a discretized coding model, including: acquiring a training image frame set, wherein the training image frame set comprises training sign language actions output by the same biological object; carrying out discretization coding on the training image frames in the training image frame set by using a discretization coding model to obtain the discretization characteristic of the gesture language action; carrying out image reconstruction on discrete characteristics of the training sign language action by using a decoder model to obtain a reconstructed image frame set; model parameters of the discretized coding model and the decoder model are adjusted based on the training image frame set and the reconstructed image frame set.
According to an aspect of the embodiments of the present application, there is provided a method for recognizing a sign language action, including: responding to an input instruction acting on the operation interface, and displaying an image frame set on the operation interface, wherein the image frame set is acquired in the process of outputting a sign language action to be recognized by a biological object; responding to a recognition instruction acting on an operation interface, and displaying a recognition result of the sign language action to be recognized on the operation interface, wherein the recognition result is used for representing the category of the vocabulary to be recognized, the recognition result is obtained by classifying discrete features of the sign language action to be recognized, the discrete features of the sign language action to be recognized are obtained by carrying out discretization coding on image frames in an image frame set, and the discrete features are used for representing the features of the vocabulary to be recognized represented by the sign language action to be recognized.
According to an aspect of the embodiments of the present application, there is provided a method for recognizing a sign language action, including: displaying an image frame set on a presentation picture of Virtual Reality (VR) equipment or Augmented Reality (AR) equipment, wherein the image frame set is acquired in the process of outputting sign language actions to be recognized by a biological object; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; classifying discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized; and driving the VR equipment or the AR equipment to render and display the recognition result of the sign language action to be recognized.
According to an aspect of an embodiment of the present application, there is provided a method for recognizing a sign language action, including: acquiring an image frame set by calling a first interface, wherein the first interface comprises a first parameter, the parameter value of the first parameter is the image frame set, and the image frame set is acquired in the process of outputting a sign language action to be recognized by a biological object; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized, which are represented by the sign language action to be recognized; classifying discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized; and outputting the recognition result of the sign language action to be recognized by calling a second interface, wherein the second interface comprises a second parameter, and the parameter value of the second parameter is the recognition result of the sign language action to be recognized.
According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium including a stored program, wherein the program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method of any one of the above embodiments.
According to an aspect of an embodiment of the present application, an electronic device includes: the acquisition device is used for acquiring an image frame set in the process that the biological object outputs the sign language action to be recognized; a processor running a program, wherein the program is run to perform the method of any of the above embodiments on data output from the acquisition device.
In the embodiment of the application, firstly, an image frame set in the process of outputting sign language actions to be recognized by a biological object is collected; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; and classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized, so that the accuracy of the recognition result of the sign language action to be recognized is improved. It is easy to notice that the discrete coding can be carried out on the image frames in the image frame set to obtain the discrete characteristics of the sign language action to be recognized, and the characteristics of the granularity of the sign language vocabulary can be better modeled, so that the sign language action to be recognized is more accurately recognized, the technical problem of lower recognition accuracy of the sign language action in the related technology is solved, and the interactive experience is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic diagram of a hardware environment of a virtual reality device according to an embodiment of the present application;
FIG. 2 is a block diagram of a computing environment for a method of sign language action recognition according to an embodiment of the present application;
fig. 3 is a flowchart of a sign language action recognition method according to embodiment 1 of the present application;
FIG. 4 is a schematic diagram of sign language discretization representation learning according to an embodiment of the subject application;
FIG. 5 is a schematic diagram of a sign language identification process according to an embodiment of the application;
FIG. 6 is a flow chart of a model training method according to embodiment 2 of the present application;
fig. 7 is a flowchart of a sign language action recognition method according to embodiment 3 of the present application;
fig. 8 is a flowchart of a sign language action recognition method according to embodiment 4 of the present application;
fig. 9 is a flowchart of a sign language action recognition method according to embodiment 5 of the present application;
fig. 10 is a schematic diagram of a sign language action recognition apparatus according to embodiment 6 of the present application;
FIG. 11 is a schematic view of a model training apparatus according to embodiment 7 of the present application;
FIG. 12 is a schematic view of a model training apparatus according to embodiment 8 of the present application;
fig. 13 is a schematic view of a sign language action recognition apparatus according to embodiment 9 of the present application;
fig. 14 is a schematic diagram of a sign language action recognition apparatus according to embodiment 10 of the present application;
fig. 15 is a block diagram of a computer terminal according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, some terms or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:
sign language recognition: identifying the corresponding sign language vocabulary category according to the continuous sign language video;
and (3) image representation learning: from high-dimensional input into an image, a latent variable representation of a dimension containing semantic information is learned.
At present, sign language video is generally sampled to generate image frames, image features are extracted from each image frame, and then the image features are decoded to obtain a recognition result of the sign language video.
The application provides a sign language action recognition method, which can better model the feature of sign language vocabulary granularity in a discretization mode and obtain a better recognition result.
Example 1
There is also provided, in accordance with an embodiment of the present application, a method for sign language action recognition, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that presented herein.
The above method embodiments referred to in the disclosure may be performed in a mobile terminal, a computer terminal or a similar computing device. Taking the example of the mobile terminal, the mobile terminal may be a terminal device such as a smart phone, a tablet computer, a palm computer, a mobile internet device, a PAD, a game machine, and the like. Fig. 1 is a schematic diagram of a hardware environment of a virtual reality device according to an embodiment of the present application. As shown in fig. 1, the virtual reality device 104 is connected to the terminal 106, and the terminal 106 is connected to the server 102 via a network, and the virtual reality device 104 is not limited to: the terminal 104 is not limited to a PC, a mobile phone, a tablet computer, etc., and the server 102 may be a server corresponding to a media file operator, where the network includes but is not limited to: a wide area network, a metropolitan area network, or a local area network.
Optionally, the virtual reality device 104 of this embodiment includes: memory, processor, and transmission means. The memory is used for storing an application program, and the application program can be used for acquiring an image frame set in a process that a biological object outputs a sign language action to be recognized; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized, which are represented by the sign language action to be recognized; the discrete features of the sign language action to be recognized are classified to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized, and therefore the technical problem that the recognition accuracy of the sign language action is low in the related technology is solved.
The terminal of this embodiment may be configured to perform displaying of the recognition result on a presentation screen of a Virtual Reality (VR) device or an Augmented Reality (AR) device. The method comprises the steps of displaying an image frame set on a display screen of virtual reality VR equipment or augmented reality AR equipment; discretizing and coding the image frames in the image frame set to obtain the discrete characteristics of the sign language action to be recognized; and classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, and driving VR equipment or AR equipment to display the recognition result of the sign language action to be recognized.
In some optional embodiments where user interaction is dominant, the device may further provide a human-machine interface with a touch-sensitive surface, which may sense finger contacts and/or gestures for human-machine interaction with a Graphical User Interface (GUI), and the human-machine interaction function may include the following interactions: executable instructions for creating web pages, drawing, word processing, making electronic documents, games, video conferencing, instant messaging, emailing, call interfacing, playing digital video, playing digital music, and/or web browsing, etc., for performing the above-described human-computer interaction functions, are configured/stored in a processor-executable computer program product or readable storage medium.
Fig. 1 shows a block diagram of a hardware structure, which may be taken as an exemplary block diagram of not only the AR/VR device (or mobile device) but also the server, and in an alternative embodiment, fig. 2 shows an embodiment of using the AR/VR device (or mobile device) shown in fig. 1 as a computing node in a computing environment 201 in a block diagram.
Fig. 2 is a block diagram of a computing environment of a sign language action recognition method according to an embodiment of the present application, and as shown in fig. 2, the computing environment 201 includes a plurality of computing nodes (e.g., servers) (shown by 210-1, 210-2, … in the figure) running on a distributed network. Each computing node contains local processing and memory resources, and end-users 202 can remotely run applications or store data within the computing environment 201. The application programs may be provided as a plurality of services 220-1,220-2,220-3, and 220-4 in computing environment 301, representing services "A", "D", "E", and "H", respectively.
End user 202 may provide and access services through a web browser or other software application on a client, and in some embodiments, the provisioning and/or requests of end user 202 may be provided to portal gateway 230. The ingress gateway 230 may include a corresponding agent to handle provisioning and/or requests for the service 220 (one or more services provided in the computing environment 201).
The services 220 are provided or deployed according to various virtualization technologies supported by the computing environment 201. In some embodiments, services 220 may be provided according to Virtual Machine (VM) based virtualization, container based virtualization, and/or the like. Virtual machine-based virtualization may be to simulate a real computer by initializing a virtual machine, executing programs and applications without directly contacting any actual hardware resources. While the virtual machine virtualizes the machine, according to container-based virtualization, a container may be launched to virtualize the entire Operating System (OS) so that multiple workloads may run on a single operating system instance.
In one embodiment of container-based virtualization, several containers of service 220 may be assembled into one POD (e.g., a kubernets POD). For example, as shown in FIG. 2, a service 220-2 may be equipped with one or more PODs 240-1, 240-2, …,240-N (collectively referred to as PODs 240). Each POD 240 may include an agent 245 and one or more containers 242-1, 242-2, …,242-M (collectively containers 242). One or more containers 242 in the POD 240 handle requests associated with one or more corresponding functions of the service, and the agent 245 generally controls network functions associated with the service, such as routing, load balancing, and the like. Other services 220 may accompany PODs similar to POD 240.
In operation, executing a user request from an end user 202 may require invoking one or more services 220 in the computing environment 201, executing one or more functions of one service 220 requiring invoking one or more functions of another service 220. As shown in FIG. 2, service "A"220-1 receives a user request of end user 202 from ingress gateway 230, service "A"220-1 may invoke service "D"220-2, and service "D"220-2 may request service "E"220-3 to perform one or more functions.
The computing environment described above may be a cloud computing environment, with allocation of resources being managed by a cloud service offering, allowing development of functionality without regard to implementing, tuning, or extending servers. The computing environment allows developers to execute code that responds to events without building or maintaining a complex infrastructure. Rather than extending a single hardware device to handle potential loads, services may be split to perform a set of functions that may be scaled independently automatically.
Under the above operating environment, the present application provides a sign language action recognition method as shown in fig. 3. It should be noted that the sign language action recognition method according to this embodiment may be executed by the mobile terminal according to the embodiment shown in fig. 1. Fig. 3 is a flowchart of a sign language action recognition method according to embodiment 1 of the present application. As shown in fig. 3, the method may include the steps of:
step S302, collecting an image frame set in the process that the biological object outputs the sign language action to be recognized.
The biological object may be a human, an animal, or the like, and the biological object is not limited thereto. The present application is described with reference to a biological subject as a sign language teacher, but is not limited thereto.
The sign language action to be recognized can be a sign language action expressed by the biological object through limbs.
The sign language action to be recognized may be a sign language action to be recognized corresponding to one vocabulary, the sign language action to be recognized corresponding to one vocabulary may include a plurality of sign language actions, and an image frame corresponding to the sign language action to be recognized corresponding to one vocabulary may be used as an image frame set.
The gesture language action to be recognized may also be a gesture language action to be recognized corresponding to a plurality of vocabularies, the gesture language action to be recognized corresponding to the plurality of vocabularies may include a plurality of gesture language actions, and the image frame corresponding to the gesture language action to be recognized corresponding to the plurality of vocabularies may be regarded as one image frame set.
The gesture language action to be recognized can also be a complete sentence or a gesture language action to be recognized corresponding to a plurality of complete sentences, the gesture language action to be recognized corresponding to a complete sentence or a plurality of complete sentences can include a plurality of gesture language actions, and an image frame corresponding to the gesture language action corresponding to a complete sentence or a plurality of complete sentences can be mostly an image frame set.
In an optional embodiment, a video image of the biological object in the process of outputting the sign language action to be recognized may be acquired, and an image frame set of the biological object in the process of outputting the sign language action to be recognized may be obtained by collecting the video image at preset intervals, where the image frame set may include a plurality of consecutive image frames.
And step S304, carrying out discretization coding on the image frames in the image frame set to obtain the discrete characteristics of the sign language action to be recognized.
The discrete features are used for representing the features of the vocabulary to be recognized represented by the gesture language action to be recognized.
The discretization encoding is to convert the sign language motion in the image frame into the discrete characteristics of the vocabulary to be recognized corresponding to the sign language motion, so that the recognition granularity can be improved, and the recognition accuracy is improved.
The discrete features can be expressed by the words to be recognized corresponding to the sign language actions to be recognized in the forms of numbers, letters, symbols and the like. By obtaining the discrete characteristics of the gesture language action to be recognized, the memory resource occupied by recognizing the gesture language action to be recognized can be reduced.
In an alternative embodiment, feature extraction may be performed on the image frames in the image frame set to obtain a feature sequence of the image frames, the feature sequence may be converted into a discrete index sequence, and a table may be looked up in a Codebook (Codebook) to obtain discrete features corresponding to the index sequence, where the discrete features are used to represent features of the vocabulary to be recognized corresponding to features of the gesture language action to be recognized.
And S306, classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized.
And the recognition result is used for representing the category of the vocabulary to be recognized.
In an optional embodiment, since the discrete features occupy little memory resources and few operation resources, the recognition speed of the sign language action to be recognized can be increased by classifying the discrete features of the sign language action to be recognized, thereby improving the recognition efficiency of the sign language action to be recognized.
In another alternative embodiment, the above-mentioned scheme may be applied to a virtual reality sign language translation scene, and the recognition result of the sign language action to be recognized may be displayed in the sign language action video of the biological object, so that when viewing the sign language action of the biological object, other users may synchronously see the category of the vocabulary to be recognized represented by the sign language action and the meaning specifically represented by the sign language action.
A biological object is taken as a sign language teacher as an example for explanation, and an image frame set in the process that the sign language teacher outputs sign language actions to be recognized can be collected; the discrete features of the gesture language actions to be recognized of the gesture language teacher can be found in a Codebook according to the features of the image frames, the discrete features corresponding to the features can be classified, and a recognition result of the gesture language actions to be recognized by the gesture language teacher can be obtained, where the recognition result can be a category of words to be expressed by the gesture language teacher through the gesture language.
The communication process taking the biological object as the hearing impaired person is taken as an example for explanation, and the image frame set in the process of outputting the sign language action to be recognized by the hearing impaired person can be collected; the discrete features of the to-be-recognized sign language action of the hearing-impaired person can be represented by numbers, but not limited to, alternatively, the discrete features corresponding to the features can be searched in a Codebook according to the features of the image frames, the discrete features of the to-be-recognized sign language action of the hearing-impaired person can be classified, and a recognition result of the to-be-recognized sign language action of the hearing-impaired person is obtained, and the recognition result can be a category of words required to be expressed by the hearing-impaired person.
In the embodiment of the application, firstly, an image frame set in the process of outputting a sign language action to be recognized by a biological object is collected; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; and classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized, so that the accuracy of the recognition result of the sign language action to be recognized is improved. It is easy to notice that the discrete coding can be performed on the image frames in the image frame set to obtain the discrete characteristics of the gesture language action to be recognized, and the characteristics of the gesture language vocabulary granularity can be better modeled, so that the gesture language action to be recognized is more accurately recognized, and the technical problem of lower recognition accuracy of the gesture language action in the related technology is solved.
In the above embodiments of the present application, discretizing and encoding the image frames in the image frame set to obtain the discrete features of the sign language action to be recognized includes: and carrying out discretization coding on the image frame by using the discretization coding model to obtain the discretization characteristic of the gesture language action to be recognized.
The discretization coding model may be Vector quantized Variational automatic encoder (VQ-VAE), but is not limited thereto, and may be other models capable of performing discretization coding.
In an alternative embodiment, the discretization coding model can be used for discretization coding the image frame according to the sign language action to be recognized in the image frame, so as to obtain the discrete features of the sign language action to be recognized.
In the above embodiments of the present application, the discretized coding model includes: encoder model and discretization vocabulary, wherein, utilize the discretization coding model to carry out the discretization encoding to image frame, obtain the discrete characteristic of waiting to discern the sign language action, include: performing feature extraction on the image frame by using an encoder model to obtain image features of the image frame; discretizing the image features of the image frame to obtain intermediate features corresponding to the image features; and acquiring preset features corresponding to the intermediate features from the discretization word list to obtain the discrete features of the sign language actions to be recognized, wherein the discretization word list stores preset features of preset quantity, and the arrangement combination of the preset features of different quantity is used for representing the vocabulary expressed by different sign language actions.
The Encoder model described above may be an Encoder network (Encoder network).
The discretized vocabulary can be a codebook.
The intermediate features described above may be discrete index sequences, each index in the index sequence may represent a sign language action.
The preset feature may be a word feature, but is not limited thereto, and word features of words represented by different sign language actions may be stored in the discretized vocabulary. The preset features can also be sentence features, and words represented by different sign language actions can be obtained by arranging and combining different numbers of sentence features.
In an optional embodiment, feature extraction may be sequentially performed on image frames in the image frame set by using an encoder model to obtain an image feature of each image frame in the image frame set, where the image feature includes a feature of a sign language action to be recognized, discretization may be performed on the image feature of each image frame in the image frame set to obtain a discretized intermediate feature, and a preset feature corresponding to the intermediate feature may be searched from a discretization vocabulary to obtain a discretization feature of the sign language action to be recognized. The image features of the image frame are discretized, so that the discretized features which occupy less memory resources and less operation resources are used in the searching process, the speed of searching the preset features corresponding to the sign language action to be recognized in the image frame can be increased, and the efficiency of recognizing the sign language action to be recognized is improved.
In the above embodiment of the application, a model parameter of the discretization coding model is adjusted based on a first loss value, the first loss value is constructed based on a training image frame set and a reconstruction image frame set, the reconstruction image frame set is obtained by image reconstruction of a splicing feature by using a decoder model, the splicing feature is obtained by splicing image features of reference image frames in the training image frame set with discrete features of training sign language actions included in the training image frame set, the image features of the reference image frames are obtained by feature extraction of the reference image frames by using a feature extraction model, and the discrete features of the training sign language actions are obtained by discretization coding of training image frames in the training image frame set by using the discretization coding model.
The training image frame set may be an image frame set corresponding to an image frame containing a sign language motion to be recognized.
The reconstructed image frame set may be a reconstructed image frame set obtained by performing discretization encoding on the training image frame set through a discretization encoding model to obtain a discretization feature and reconstructing according to the discretization feature.
The first loss value can be constructed through the training image frame set and the reconstruction image frame set, if the difference between the reconstruction image frame set and the training image frame set is large, the discretization coding accuracy of the discretization coding model is low, at the moment, the model parameter of the discretization coding model needs to be adjusted according to the first loss value, and the accuracy of the discretization coding model is improved. If the consistency or difference between the reconstructed image frame set and the training image frame set is less, it indicates that the discretization encoding accuracy of the discretization encoding model is higher. The first loss value may be a loss value of a normalized loss function (L2 norm), and the first loss value may also be a loss value of any other type of loss function, which is not limited herein.
The reference image frame may be a reference image frame with higher definition in the training image frame set or a reference image frame with more complete sign language action. The reference image frame may be any one of a set of training image frames. The reference image frame may also be an image frame determined by the user from a set of training image frames.
The feature extraction model may be a model (Transformer) that utilizes an attention mechanism to increase the training speed of the model.
In an optional embodiment, a discretization coding model can be used for discretization coding of training image frames in a training image frame set to obtain discrete features of training sign language actions, a reference image frame can be extracted from the training image frame set, the image features of the reference image frame can be spliced with the discrete features of the training sign language actions respectively to obtain a splicing feature, and the splicing feature contains the image features of the reference image, so that when the splicing feature is subjected to image reconstruction, a reconstructed image frame set with higher accuracy can be reconstructed.
In another alternative embodiment, the training image frame set may be displayed to a user, the user may determine a reference image frame from the training image frame set, may perform feature extraction on the reference image frame to obtain image features of the reference image frame, and may respectively splice the image features of the reference image frame with discrete features of a training sign language action included in the training image frame set to obtain a spliced feature.
Fig. 4 is a schematic diagram of sign language discretization representation learning according to an embodiment of the present application, where an input of the schematic diagram may be sign language isolated vocabulary data, each video may correspond to a sign language vocabulary, the sign language vocabulary may be expressed as a training image frame in a training image frame set, a discretization encoding model may be used to perform discretization encoding on a training image frame in the training image frame set to obtain discrete features (vectors) of a sign language action to be recognized, optionally, an image feature of the training image frame may be subjected to a discretization operation to obtain intermediate features corresponding to the image features, a word feature corresponding to the intermediate features may be obtained from a discretization vocabulary, thereby obtaining the discrete features of the sign language action to be recognized, optionally, the word feature may be spliced with an image feature of a reference image frame to obtain a spliced feature, a length of the image frame may be obtained by sampling a decoder model, and then the image frame may be reconstructed in the image frame set by a multi-Layer deconvolution neural network Layer (discretization Layer) to achieve reconstruction of the image frame set of a reconstructed image, and a parameter of the discretization encoding model of the first discretization image frame set may be adjusted according to the reconstructed image frame set and the reconstructed image.
In the above embodiment of the present application, classifying the discrete features of the gesture language action to be recognized to obtain the recognition result of the gesture language action to be recognized includes: and classifying the discrete features of the sign language action to be recognized by using the recognition model to obtain a recognition result.
The recognition model may be a classification network (Transformer), wherein the classification network may include a classification layer (classifier).
In an optional embodiment, the discrete features of the sign language action to be recognized can be classified by using the recognition model, and the classification efficiency can be improved and the recognition result obtaining efficiency can be improved by classifying the discrete features.
In the above embodiments of the present application, the model parameter of the recognition model is adjusted based on the second loss value, the second loss value is determined based on the recognition result of the sign language action and the received feedback result, and the feedback result is obtained by modifying the recognition result after the recognition result is output.
In an optional embodiment, the recognition result of the sign language action may be output to the client of the user, if the accuracy of the recognition result is low, the user may modify the recognition result, and after the recognition result is modified, a feedback result may be obtained, a second loss value may be determined according to the recognition result and the feedback result, and the model parameter of the recognition model may be adjusted according to the second loss value, thereby improving the model recognition effect of the recognition model. The second Loss value may be a Loss value of a continuous time Classification Loss function (CTC Loss), and the second Loss value may also be a Loss value of any other type of Loss function, which is not limited herein.
Illustratively, if the sign language action needs to represent "hello", but the recognition result of the outputted sign language action is "bailey", the user may modify the result "hello" to "bailey", and determine a second loss value according to "hello" and "bailey", and may adjust the model parameters of the recognition model according to the second loss value.
Fig. 5 is a schematic diagram of a sign language recognition process according to an embodiment of the present application, which may input a continuous sign language video, where the sign language video includes a plurality of sign language words, the sign language video may be a video encoded by a discretization model to obtain discrete features of a sign language action to be recognized, and the discrete features may be classified by a recognition model to obtain a category of the sign language action to be recognized, that is, a recognition result. Optionally, the discrete features may be input into a Transformer for vocabulary classification, so as to output a category sequence of vocabularies and complete a classification task. After the recognition result is obtained, the recognition result of the sign language action can be output; receiving a feedback result obtained by modifying the identification result; determining a second loss value based on the recognition result and the feedback result; and adjusting the model parameters of the identification model based on the second loss value.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method according to the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method of the embodiments of the present application.
Example 2
There is also provided, in accordance with an embodiment of the present application, a method for training a discretized coding model, where it is noted that the steps illustrated in the flowchart of the figure can be performed in a computer system, such as a set of computer-executable instructions, and where, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described can be performed in an order different than here.
Fig. 6 is a flowchart of a training method of a discretized coding model according to embodiment 2 of the present application, and as shown in fig. 6, the method may include the following steps:
step S602, a training image frame set is obtained.
Wherein the training image frame set comprises training sign language actions output by the same biological object.
The same biological object as described above may refer to the same type of biological object, wherein the biological object may refer to a human, the same kind of animal, or the like.
The training image frame set may be an image frame set corresponding to an image frame containing a sign language motion to be recognized.
Step S604, carrying out discretization coding on the training image frames in the training image frame set by using the discretization coding model to obtain the discretization characteristics of the training sign language action.
And step S606, utilizing a decoder model to carry out image reconstruction on the discrete features of the training sign language action to obtain a reconstructed image frame set.
Step S608, adjusting model parameters of the discretized coding model and the decoder model based on the training image frame set and the reconstructed image frame set.
Through the steps, the training image frames in the training image frame set can be subjected to discretization coding by using the discretization coding model to obtain the discrete characteristics of the training sign language action, so that less resources can be occupied to express the training sign language action, the decoder model can be used for carrying out image reconstruction on the discrete characteristics of the training sign language action to obtain the reconstructed image frame set, the model parameters of the discretization coding model and the decoder model can be adjusted according to the training image frame set and the reconstructed image frame set, the accuracy of discretization coding of the training image frames by using the discretization coding model is improved, and the accuracy of reconstructing the discrete characteristics by using the decoder model is improved.
In the above embodiment of the present application, the image reconstruction is performed on the discrete features of the gesture language training action by using the decoder model, so as to obtain a reconstructed image frame set, including: performing feature extraction on a reference image frame in the training image frame set by using a feature extraction model to obtain image features of the reference image frame; splicing the image characteristics of the reference image frame with the discrete characteristics of the gesture language training action to obtain spliced characteristics; and carrying out image reconstruction on the spliced features by using a decoder model to obtain a reconstructed image frame set.
In an optional embodiment, the feature extraction model may be used to perform feature extraction on a reference image frame in the training image frame set to obtain image features of the reference image frame, and the image features of the reference image frame may be respectively spliced with discrete features of a gesture language training action to obtain a splicing feature.
In the above embodiments of the present application, the discretized coding model includes: encoder model and discretization vocabulary, wherein, utilize the discretization coding model to carry out the discretization coding to the training image frame in the training image frame set, obtain the discrete characteristic of training sign language action, include: performing feature extraction on the training image frame by using an encoder model to obtain the image features of the training image frame; discretizing the image features of the training image frame to obtain intermediate features corresponding to the image features; and acquiring preset characteristics corresponding to the intermediate characteristics from the discretization word list, wherein the arrangement combination of the preset characteristics with different quantities is used for representing the words expressed by different sign language actions.
The image features of the image frame are discretized, so that the discretized features which occupy less memory resources and less operation resources are used in the searching process, the speed of searching the preset features corresponding to the sign language action to be recognized in the image frame can be increased, and the efficiency of recognizing the sign language action to be recognized is improved.
In the above embodiment of the present application, adjusting model parameters of a discretized coding model and a discretized decoder model based on a training image frame set and a reconstructed image frame set includes: constructing a loss value of a preset loss function based on the training image frame set and the reconstructed image frame set; model parameters of the discretized encoding model and the decoder model are adjusted based on the loss values.
The predetermined penalty function may be L2norm.
In an optional embodiment, a loss value of the preset loss function may be constructed according to the training image frame set and the reconstructed image frame set, and model parameters of the discretized coding model and the discretized decoder model may be adjusted according to the loss value, so as to improve accuracy of the discretized coding model and the discretized decoder model.
It should be noted that the preferred embodiments described in the above examples of the present application are the same as the schemes, application scenarios, and implementation procedures provided in example 1, but are not limited to the schemes provided in example 1.
Example 3
There is also provided, in accordance with an embodiment of the present application, a method for sign language action recognition, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 7 is a flowchart of a sign language action recognition method according to embodiment 3 of the present application, and as shown in fig. 7, the method may include the following steps:
step S702, in response to an input instruction acting on the operation interface, displays the image frame set on the operation interface.
The image frame set is acquired in the process of outputting sign language actions to be recognized by the biological objects.
The input instruction may be obtained by a user operating an operation interface.
And step S704, responding to the identification instruction acted on the operation interface, and displaying the identification result of the sign language action to be identified on the operation interface.
The recognition result is used for representing the category of the vocabulary to be recognized, the recognition result is obtained by classifying the discrete features of the sign language action to be recognized, the discrete features of the sign language action to be recognized are obtained by carrying out discretization coding on the image frames in the image frame set, and the discrete features are used for representing the features of the vocabulary to be recognized, which are represented by the sign language action to be recognized.
The identification instruction may be the identification instruction obtained by a user operating an operation interface.
Through the steps, the image frame set can be displayed on the operation interface in response to the input instruction acting on the operation interface, so that a user can check the image frame set, the user can determine whether the image frame set needs to be identified or not according to the displayed image frame set, if the image frame set needs to be identified, the image frame set can be subjected to discretization coding on the operation interface to obtain discrete features of the sign language action to be identified, the discrete features are classified to obtain an identification result, and the identification result of the sign language action to be identified is displayed on the operation interface. The image frame set is processed in an interactive mode, so that a recognition result of the sign language action to be recognized is obtained, and convenience of user operation can be improved.
It should be noted that the preferred embodiments described in the above examples of the present application are the same as the schemes, application scenarios, and implementation procedures provided in example 1, but are not limited to the schemes provided in example 1.
Example 4
There is also provided, in accordance with an embodiment of the present application, a method for sign language action recognition, it being noted that the steps illustrated in the flowchart of the drawings may be carried out in a computer system such as a set of computer executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be carried out in an order different than here.
Fig. 8 is a flowchart of a sign language action recognition method according to embodiment 4 of the present application, and as shown in fig. 8, the method may include the following steps:
step S802, an image frame set is displayed on a presentation screen of the virtual reality VR device or the augmented reality AR device.
The image frame set is acquired in the process of outputting sign language actions to be recognized by the biological objects.
By presenting a set of image frames on a presentation screen of a virtual reality VR device or an augmented reality AR device, an applicable scene for identifying the set of image frames may be improved.
Step S804, discretizing and coding the image frames in the image frame set to obtain the discrete characteristics of the sign language action to be recognized.
The discrete features are used for representing the features of the vocabulary to be recognized represented by the gesture language action to be recognized.
And step S806, classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized.
The recognition result is used for representing the category of the vocabulary to be recognized;
and step S808, driving the VR equipment or the AR equipment to render and display the recognition result of the sign language action to be recognized.
Through the steps, the image frame set is displayed on the display picture of the virtual reality VR device or the augmented reality AR device, then discretization coding can be carried out on the image frames in the image frame set, discrete features of the sign language actions to be recognized are obtained, the discrete features of the sign language actions to be recognized are classified, recognition results of the sign language actions to be recognized are obtained, and the VR device or the AR device is driven to render and display the recognition results of the sign language actions to be recognized. That is to say, the discrete coding can be carried out on the image frames in the image frame set to obtain the discrete characteristics of the sign language action to be recognized, and the characteristics of the sign language vocabulary granularity can be better modeled, so that the sign language action to be recognized is recognized more accurately, and the technical problem that the recognition accuracy of the sign language action in the related technology is low is solved.
Alternatively, in this embodiment, the sign language action recognition method may be applied to a hardware environment formed by a server and a virtual reality device. Controlling the VR device or the AR device to perform a human-computer interaction operation corresponding to the recognition result of the sign language action, where the server may be a server corresponding to a media file operator, and the network includes but is not limited to: wide area network, metropolitan area network or local area network, the virtual reality device is not specified in: virtual reality helmets, virtual reality glasses, virtual reality all-in-one machines and the like.
Optionally, the virtual reality device comprises: memory, processor, and transmission means. The memory is used for storing a program, and the application program can be used for executing: the method comprises the steps of displaying an image frame set on a display picture of virtual reality VR equipment or augmented reality AR equipment, carrying out discretization coding on image frames in the image frame set to obtain discrete features of the sign language actions to be recognized, classifying the discrete features of the sign language actions to be recognized to obtain recognition results of the sign language actions to be recognized, and driving the VR equipment or the AR equipment to render and display the recognition results of the sign language actions to be recognized.
It should be noted that the method for broadcasting the sign language applied to the VR device or the AR device in this embodiment may include the method in the embodiment shown in fig. 8, so as to control the VR device or the AR device to perform the human-computer interaction operation corresponding to the sign language action to be recognized.
Alternatively, the processor of this embodiment may call the application stored in the memory through the transmission device to execute the above steps. The transmission device may receive the image frame set sent by the server through the network, and may also be used for data transmission between the processor and the memory.
Optionally, in the virtual reality device, a head mounted display with eye tracking is provided, the HMD displays a screen for displaying a displayed video picture, an eye tracking module in the HMD is used for acquiring a real-time movement track of the eyes of the user, a tracking system is used for tracking position information and movement information of the user in a real three-dimensional space, and a calculation processing unit is used for acquiring the real-time position and movement information of the user from the tracking system and calculating three-dimensional coordinates of the head of the user in the virtual three-dimensional space, a visual field orientation of the user in the virtual three-dimensional space, and the like.
In this embodiment of the present application, the virtual reality device may be connected to a terminal, and the terminal is connected to the server through a network, where the virtual reality device is not limited to: the terminal is not limited to a PC, a mobile phone, a tablet computer, etc., the server may be a server corresponding to an image frame set operator, and the network includes but is not limited to: a wide area network, a metropolitan area network, or a local area network.
It should be noted that the preferred embodiments described in the above examples of the present application are the same as the schemes, application scenarios, and implementation procedures provided in example 1, but are not limited to the schemes provided in example 1.
Example 5
There is also provided, in accordance with an embodiment of the present application, a method for sign language action recognition, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 9 is a flowchart of a sign language action recognition method according to embodiment 5 of the present application, and as shown in fig. 9, the method may include the following steps:
in step S902, an image frame set is obtained by calling a first interface.
The first interface comprises a first parameter, the parameter value of the first parameter is an image frame set, and the image frame set is acquired in the process that the biological object outputs the sign language action to be recognized.
The first interface can be an interface for data interaction between the server and the client, and the client can transmit the image frame set into an interface function as a first parameter of the interface function, so that the purpose of uploading the image frame set to the cloud server is achieved.
Step S904, performing discretization encoding on the image frames in the image frame set to obtain the discrete features of the sign language actions to be recognized.
The discrete features are used for representing the features of the vocabulary to be recognized represented by the gesture language action to be recognized.
And step S906, classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized.
And the recognition result is used for representing the category of the vocabulary to be recognized.
And step S908, outputting the recognition result of the gesture language action to be recognized by calling a second interface.
The second interface comprises a second parameter, and the parameter value of the second parameter is the recognition result of the sign language action to be recognized.
The second interface in the above steps may be an interface for data exchange between the cloud server and the client, and the cloud server may transmit the recognition result of the sign language action to be recognized into the interface function as a second parameter of the interface function, so as to achieve the purpose of issuing the recognition result of the sign language action to be recognized to the client.
Through the steps, an image frame set is obtained by calling a first interface, wherein the first interface comprises a first parameter, the parameter value of the first parameter is the image frame set, and the image frame set is acquired in the process of outputting the sign language to be recognized by the biological object; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; classifying discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized; and outputting the recognition result of the sign language action to be recognized by calling a second interface, wherein the second interface comprises a second parameter, and the parameter value of the second parameter is the recognition result of the sign language action to be recognized. That is to say, the discrete coding can be carried out on the image frames in the image frame set to obtain the discrete characteristics of the sign language action to be recognized, and the characteristics of the sign language vocabulary granularity can be better modeled, so that the sign language action to be recognized is recognized more accurately, and the technical problem that the recognition accuracy of the sign language action in the related technology is low is solved.
It should be noted that the preferred embodiments described in the above examples of the present application are the same as the schemes, application scenarios, and implementation procedures provided in example 1, but are not limited to the schemes provided in example 1.
Example 6
According to an embodiment of the present application, there is further provided a device for recognizing a sign language action corresponding to the above method for recognizing a sign language action, fig. 10 is a schematic diagram of a device for recognizing a sign language action according to embodiment 6 of the present application, and as shown in fig. 10, the device 1000 includes: an acquisition module 1002, an encoding module 1004, and a classification module 1006.
The acquisition module is used for acquiring an image frame set in the process of outputting sign language actions to be recognized by a biological object; the coding module is used for carrying out discretization coding on the image frames in the image frame set to obtain the discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; the classification module is used for classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized.
It should be noted here that the above-mentioned acquisition module 1002, encoding module 1004 and classification module 1006 correspond to steps S302 to S306 in embodiment 1, and the three modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above as a part of the apparatus may be operated in the computer terminal provided in embodiment 1.
In the above embodiment of the application, the coding module is configured to perform discretization coding on the image frame by using a discretization coding model to obtain a discretization feature of the gesture language action to be recognized.
In the above embodiments of the present application, the discretized coding model includes: encoder model and discretization vocabulary, wherein, the coding module includes: the device comprises an extraction unit, an operation unit and an acquisition unit.
The extraction unit is used for extracting the features of the image frame by using the encoder model to obtain the image features of the image frame; the operation unit is used for carrying out discretization operation on the image features of the image frame to obtain intermediate features corresponding to the image features; the acquisition unit is used for acquiring preset characteristics corresponding to the intermediate characteristics from the discretization word list to obtain the discrete characteristics of the sign language actions to be recognized, wherein preset characteristics of preset quantity are stored in the discretization word list, and the arrangement combination of the preset characteristics of different quantity is used for representing the words expressed by different sign language actions.
In the above embodiment of the application, a model parameter of the discretization coding model is adjusted based on a first loss value, the first loss value is constructed based on a training image frame set and a reconstruction image frame set, the reconstruction image frame set is obtained by image reconstruction of a splicing feature by using a decoder model, the splicing feature is obtained by splicing image features of reference image frames in the training image frame set with discrete features of training sign language actions included in the training image frame set, the image features of the reference image frames are obtained by feature extraction of the reference image frames by using a feature extraction model, and the discrete features of the training sign language actions are obtained by discretization coding of training image frames in the training image frame set by using the discretization coding model.
In the above embodiment of the application, the classification module is configured to classify the discrete features of the sign language action to be recognized by using the recognition model, so as to obtain a recognition result.
In the above embodiments of the present application, the model parameter of the recognition model is adjusted based on the second loss value, the second loss value is determined based on the recognition result of the sign language action and the received feedback result, and the feedback result is obtained by modifying the recognition result after the recognition result is output.
It should be noted that the preferred embodiments described in the foregoing examples of the present application are the same as the schemes, application scenarios, and implementation procedures provided in example 1, but are not limited to the schemes provided in example 1.
Example 7
According to an embodiment of the present application, there is further provided a training apparatus for implementing a discretized coding model corresponding to the above training method for a discretized coding model, where fig. 11 is a schematic diagram of a training apparatus for a discretized coding model according to embodiment 7 of the present application, and as shown in fig. 11, the apparatus 1100 includes: an obtaining module 1102, an encoding module 1104, a reconstruction module 1106, and an adjusting module 1108.
The acquisition module is used for acquiring a training image frame set, wherein the training image frame set comprises training sign language actions output by the same biological object; the coding module is used for carrying out discretization coding on the training image frames in the training image frame set by using the discretization coding model to obtain the discrete characteristics of the training sign language action; the reconstruction module is used for reconstructing images of the discrete features of the training sign language actions by using the decoder model to obtain a reconstructed image frame set; the adjusting module is used for adjusting model parameters of the discretization coding model and the decoder model based on the training image frame set and the reconstructed image frame set.
It should be noted that the obtaining module 1102, the encoding module 1104, the reconstructing module 1106, and the adjusting module 1108 correspond to steps S702 to S708 in embodiment 2, and the four modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules as part of the apparatus may be run in the computer terminal provided in embodiment 1.
In the above embodiments of the present application, the reconstruction module includes: the device comprises an extraction unit, a splicing unit and a reconstruction unit.
The extraction unit is used for extracting the features of a reference image frame in the training image frame set by using the feature extraction model to obtain the image features of the reference image frame; the splicing unit is used for splicing the image characteristics of the reference image frame with the discrete characteristics of the training sign language action to obtain splicing characteristics; and the reconstruction unit is used for reconstructing the image of the splicing characteristics by using the decoder model to obtain a reconstructed image frame set.
In the above embodiments of the present application, the discretized coding model includes: encoder model and discretization vocabulary, wherein, the coding module includes: the device comprises an encoding unit, an operation unit and an acquisition unit.
The encoding unit is used for extracting features of the training image frame by using an encoder model to obtain image features of the training image frame; the operation unit is used for carrying out discretization operation on the image features of the training image frame to obtain intermediate features corresponding to the image features; the acquisition unit is used for acquiring preset characteristics corresponding to the intermediate characteristics from the discretization word list to obtain the discrete characteristics of the training sign language action, wherein the discretization word list stores preset characteristics of preset quantity, and the arrangement combination of the preset characteristics of different quantity is used for representing vocabularies represented by different sign language actions.
In the above embodiments of the present application, the adjusting module includes: the device comprises a construction unit and an adjusting unit.
The construction unit is used for constructing a loss value of a preset loss function based on the training image frame set and the reconstructed image frame set; the adjusting unit is used for adjusting model parameters of the discretization coding model and the decoder model based on the loss value.
It should be noted that the preferred embodiments described in the above examples of the present application are the same as the schemes, application scenarios, and implementation procedures provided in example 1, but are not limited to the schemes provided in example 1.
Example 8
According to an embodiment of the present application, there is further provided a device for recognizing a sign language action corresponding to the method for recognizing a sign language action, and fig. 12 is a schematic diagram of a model training device according to embodiment 8 of the present application, as shown in fig. 12, the device 1200 includes: a first display module 1202 and a second display module 1204.
The first display module is used for responding to an input instruction acting on the operation interface and displaying an image frame set on the operation interface, wherein the image frame set is acquired in the process of outputting a sign language action to be recognized by a biological object; the second display module is used for responding to a recognition instruction acting on the operation interface and displaying a recognition result of the to-be-recognized sign language action on the operation interface, wherein the recognition result is used for representing the category of the to-be-recognized vocabulary, the recognition result is obtained by classifying discrete features of the to-be-recognized sign language action, the discrete features of the to-be-recognized sign language action are obtained by performing discretization coding on image frames in the image frame set, and the discrete features are used for representing the features of the to-be-recognized vocabulary represented by the to-be-recognized sign language action.
It should be noted that the first display module 1202 and the second display module 1204 correspond to steps S802 to S804 in embodiment 3, and the two modules are the same as the corresponding steps in the implementation example and the application scenario, but are not limited to the disclosure of embodiment 1. It should be noted that the modules described above as a part of the apparatus may be operated in the computer terminal provided in embodiment 1.
It should be noted that the preferred embodiments described in the above examples of the present application are the same as the schemes, application scenarios, and implementation procedures provided in example 1, but are not limited to the schemes provided in example 1.
Example 9
According to an embodiment of the present application, there is further provided a device for recognizing a sign language action corresponding to the above method for recognizing a sign language action, and fig. 13 is a schematic diagram of a device for recognizing a sign language action according to embodiment 9 of the present application, and as shown in fig. 13, the device 1300 includes: a presentation module 1302, an encoding module 1304, a classification module 1306, a drive module 1308.
The display module is used for displaying an image frame set on a display picture of the virtual reality VR device or the augmented reality AR device, wherein the image frame set is acquired in the process of outputting sign language actions to be recognized by biological objects; the coding module is used for carrying out discretization coding on the image frames in the image frame set to obtain the discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; the classification module is used for classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized; the driving module is used for driving the VR equipment or the AR equipment to render and display the recognition result of the sign language action to be recognized.
It should be noted that the above-mentioned display module 1302, encoding module 1304, classifying module 1306 and driving module 1308 correspond to steps S902 to S908 in embodiment 5, and the four modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above as a part of the apparatus may be operated in the computer terminal provided in embodiment 1.
It should be noted that the preferred embodiments described in the above examples of the present application are the same as the schemes, application scenarios, and implementation procedures provided in example 1, but are not limited to the schemes provided in example 1.
Example 10
According to an embodiment of the present application, there is further provided a device for recognizing a sign language action corresponding to the method for recognizing a sign language action, and fig. 14 is a schematic diagram of the device for recognizing a sign language action according to embodiment 10 of the present application, as shown in fig. 14, the device 1400 includes: an acquisition module 1402, an encoding module 1404, a classification module 1406, and an output module 1408.
The acquisition module is used for acquiring an image frame set by calling a first interface, wherein the first interface comprises a first parameter, the parameter value of the first parameter is the image frame set, and the image frame set is acquired in the process of outputting a sign language action to be recognized by a biological object; the coding module is used for carrying out discretization coding on the image frames in the image frame set to obtain the discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; the classification module is used for classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized; the output module is used for outputting the recognition result of the sign language action to be recognized by calling the second interface, wherein the second interface comprises a second parameter, and the parameter value of the second parameter is the recognition result of the sign language action to be recognized.
It should be noted here that the above-mentioned obtaining module 1402, encoding module 1404, classifying module 1406, and outputting module 1408 correspond to steps S1002 to S1008 in embodiment 6, and the four modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the above modules as part of the apparatus may be run in the computer terminal provided in embodiment 1.
It should be noted that the preferred embodiments described in the above examples of the present application are the same as the schemes, application scenarios, and implementation procedures provided in example 1, but are not limited to the schemes provided in example 1.
Example 11
Embodiments of the present application may provide an electronic device, which may be any one of electronic devices in a group of electronic devices. Optionally, in this embodiment, the electronic device may also be replaced with a terminal device such as a mobile terminal.
Optionally, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of a computer network.
In this embodiment, the electronic device may execute program codes of the following steps in the method for recognizing sign language actions: collecting an image frame set in the process of outputting sign language actions to be recognized by a biological object; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; and classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized.
Alternatively, fig. 15 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 15, the computer terminal a may include: one or more processors (only one shown), memory.
The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the sign language action recognition method and apparatus in the embodiments of the present application, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implementing the above-mentioned sign language action recognition method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located from the processor, which may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: collecting an image frame set in the process that a biological object outputs a sign language action to be recognized; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; and classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized.
Optionally, the processor may further execute the program code of the following steps: and carrying out discretization coding on the image frame by using the discretization coding model to obtain the discrete characteristics of the sign language action to be recognized.
Optionally, the processor may further execute the program code of the following steps: performing feature extraction on the image frame by using an encoder model to obtain image features of the image frame; discretizing the image features of the image frame to obtain intermediate features corresponding to the image features; and acquiring preset features corresponding to the intermediate features from the discretization word list to obtain the discrete features of the training sign language action, wherein the discretization word list stores preset features of preset quantity, and the arrangement combination of the preset features of different quantity is used for representing words expressed by different sign language actions.
Optionally, the processor may further execute the program code of the following steps: the method comprises the steps that model parameters of a discretization coding model are adjusted based on a first loss value, the first loss value is constructed based on a training image frame set and a reconstruction image frame set, the reconstruction image frame set is obtained by utilizing a decoder model to conduct image reconstruction on splicing characteristics, the splicing characteristics are obtained by splicing image characteristics of reference image frames in the training image frame set with discrete characteristics of training sign language actions contained in the training image frame set, the image characteristics of the reference image frames are obtained by conducting feature extraction on the reference image frames through a feature extraction model, and the discrete characteristics of the training sign language actions are obtained by discretizing and coding training image frames in the training image frame set through the discretization coding model.
Optionally, the processor may further execute the program code of the following steps: and classifying the discrete features of the sign language action to be recognized by using the recognition model to obtain a recognition result.
Optionally, the processor may further execute the program code of the following steps: and adjusting the model parameters of the recognition model based on a second loss value, wherein the second loss value is determined based on the recognition result of the sign language action and the received feedback result, and the feedback result is obtained by modifying the recognition result after the recognition result is output.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring a training image frame set, wherein the training image frame set comprises training sign language actions output by the same biological object; carrying out discretization coding on the training image frames in the training image frame set by using a discretization coding model to obtain the discretization characteristic of the gesture language action; carrying out image reconstruction on discrete characteristics of the training sign language action by using a decoder model to obtain a reconstructed image frame set; model parameters of the discretized coding model and the decoder model are adjusted based on the training image frame set and the reconstructed image frame set.
Optionally, the processor may further execute the program code of the following steps: performing feature extraction on a reference image frame in the training image frame set by using a feature extraction model to obtain image features of the reference image frame; splicing the image characteristics of the reference image frame with the discrete characteristics of the gesture language training action to obtain spliced characteristics; and carrying out image reconstruction on the spliced features by using a decoder model to obtain a reconstructed image frame set.
Optionally, the processor may further execute the program code of the following steps: performing feature extraction on the training image frame by using an encoder model to obtain image features of the training image frame; discretizing the image features of the training image frame to obtain intermediate features corresponding to the image features; and acquiring preset features corresponding to the intermediate features from the discretization word list to obtain the discrete features of the training sign language action, wherein the discretization word list stores preset features of preset quantity, and the arrangement combination of the preset features of different quantity is used for representing vocabularies represented by different sign language actions.
Optionally, the processor may further execute the program code of the following steps: constructing a loss value of a preset loss function based on the training image frame set and the reconstructed image frame set; model parameters of the discretized coding model and the decoder model are adjusted based on the loss values.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: responding to an input instruction acting on the operation interface, and displaying an image frame set on the operation interface, wherein the image frame set is acquired in the process of outputting a sign language action to be recognized by a biological object; responding to a recognition instruction acting on an operation interface, and displaying a recognition result of the sign language action to be recognized on the operation interface, wherein the recognition result is used for representing the category of the vocabulary to be recognized, the recognition result is obtained by classifying discrete features of the sign language action to be recognized, the discrete features of the sign language action to be recognized are obtained by carrying out discretization coding on image frames in an image frame set, and the discrete features are used for representing the features of the vocabulary to be recognized represented by the sign language action to be recognized.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: displaying an image frame set on a presentation picture of Virtual Reality (VR) equipment or Augmented Reality (AR) equipment, wherein the image frame set is acquired in the process of outputting sign language actions to be recognized by a biological object; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized, which are represented by the sign language action to be recognized; classifying discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized; and driving the VR equipment or the AR equipment to render and display the recognition result of the sign language action to be recognized.
The processor can call the information and application program stored in the memory through the transmission device to execute the following steps: acquiring an image frame set by calling a first interface, wherein the first interface comprises a first parameter, the parameter value of the first parameter is the image frame set, and the image frame set is acquired in the process of outputting a sign language to be recognized by a biological object; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized, which are represented by the sign language action to be recognized; classifying discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized; and outputting the recognition result of the sign language action to be recognized by calling a second interface, wherein the second interface comprises a second parameter, and the parameter value of the second parameter is the recognition result of the sign language action to be recognized.
In the embodiment of the application, firstly, an image frame set in the process of outputting sign language actions to be recognized by a biological object is collected; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; and classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized, so that the accuracy of the recognition result of the sign language action to be recognized is improved. It is easy to notice that the image frames in the image frame set can be subjected to discretization coding to obtain the discrete characteristics of the gesture language action to be recognized, and the characteristics of the gesture language vocabulary granularity can be better modeled, so that the gesture language action to be recognized is more accurately recognized, and the technical problem of lower recognition accuracy of the gesture language action in the related technology is solved.
It can be understood by those skilled in the art that the structure shown in fig. 15 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 15 is a diagram illustrating a structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 15, or have a different configuration than shown in FIG. 15.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.
Example 12
Embodiments of the present application also provide a computer-readable storage medium. Alternatively, in this embodiment, the computer-readable storage medium may be configured to store the program code executed by the sign language action recognition method provided in embodiment 1.
Optionally, in this embodiment, the computer-readable storage medium may be located in any one computer terminal in an AR/VR device terminal group in an AR/VR device network, or in any one mobile terminal in a mobile terminal group.
Optionally, in this embodiment, the computer readable storage medium is configured to store program code for performing the following steps: collecting an image frame set in the process of outputting sign language actions to be recognized by a biological object; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; and classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized.
Optionally, the storage medium is further configured to store program code for performing the following steps: and carrying out discretization coding on the image frame by using the discretization coding model to obtain the discrete characteristics of the sign language action to be recognized.
Optionally, the storage medium is further configured to store program code for performing the following steps: performing feature extraction on the image frame by using an encoder model to obtain image features of the image frame; discretizing the image features of the image frame to obtain intermediate features corresponding to the image features; and acquiring preset characteristics corresponding to the intermediate characteristics from the discretization word list to obtain the discrete characteristics of the sign language action to be recognized, wherein the discretization word list stores preset characteristics with preset quantity, and the arrangement combination of the preset characteristics with different quantity is used for representing the vocabulary expressed by different sign language actions.
Optionally, the storage medium is further configured to store program code for performing the following steps: the method comprises the steps that model parameters of a discretization coding model are adjusted based on a first loss value, the first loss value is constructed based on a training image frame set and a reconstruction image frame set, the reconstruction image frame set is obtained by utilizing a decoder model to conduct image reconstruction on splicing characteristics, the splicing characteristics are obtained by splicing image characteristics of reference image frames in the training image frame set with discrete characteristics of training sign language actions contained in the training image frame set, the image characteristics of the reference image frames are obtained by conducting feature extraction on the reference image frames through a feature extraction model, and the discrete characteristics of the training sign language actions are obtained by discretizing and coding training image frames in the training image frame set through the discretization coding model.
Optionally, the storage medium is further configured to store program code for performing the following steps: and classifying the discrete features of the sign language action to be recognized by using the recognition model to obtain a recognition result.
Optionally, the storage medium is further configured to store program code for performing the following steps: and adjusting the model parameters of the recognition model based on a second loss value, wherein the second loss value is determined based on the recognition result of the sign language action and the received feedback result, and the feedback result is obtained by modifying the recognition result after the recognition result is output.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring a training image frame set, wherein the training image frame set comprises training sign language actions output by the same biological object; carrying out discretization coding on the training image frames in the training image frame set by using a discretization coding model to obtain the discretization characteristic of the gesture language action; carrying out image reconstruction on discrete characteristics of the training sign language action by using a decoder model to obtain a reconstructed image frame set; model parameters of the discretized coding model and the decoder model are adjusted based on the training image frame set and the reconstructed image frame set.
Optionally, the storage medium is further configured to store program code for performing the following steps: performing feature extraction on a reference image frame in the training image frame set by using a feature extraction model to obtain image features of the reference image frame; splicing the image characteristics of the reference image frame with the discrete characteristics of the gesture language training action to obtain spliced characteristics; and carrying out image reconstruction on the spliced features by using a decoder model to obtain a reconstructed image frame set.
Optionally, the storage medium is further configured to store program code for performing the following steps: performing feature extraction on the training image frame by using an encoder model to obtain image features of the training image frame; discretizing the image features of the training image frame to obtain intermediate features corresponding to the image features; and acquiring preset features corresponding to the intermediate features from the discretization word list to obtain the discrete features of the training sign language action, wherein the discretization word list stores preset features of preset quantity, and the arrangement combination of the preset features of different quantity is used for representing words expressed by different sign language actions.
Optionally, the storage medium is further configured to store program code for performing the following steps: constructing a loss value of a preset loss function based on the training image frame set and the reconstructed image frame set; model parameters of the discretized coding model and the decoder model are adjusted based on the loss values.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: responding to an input instruction acting on the operation interface, and displaying an image frame set on the operation interface, wherein the image frame set is acquired in the process of outputting a sign language action to be recognized by a biological object; responding to a recognition instruction acting on an operation interface, and displaying a recognition result of the sign language action to be recognized on the operation interface, wherein the recognition result is used for representing the category of the vocabulary to be recognized, the recognition result is obtained by classifying discrete features of the sign language action to be recognized, the discrete features of the sign language action to be recognized are obtained by carrying out discretization coding on image frames in an image frame set, and the discrete features are used for representing the features of the vocabulary to be recognized represented by the sign language action to be recognized.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: displaying an image frame set on a presentation picture of Virtual Reality (VR) equipment or Augmented Reality (AR) equipment, wherein the image frame set is acquired in the process of outputting sign language actions to be recognized by a biological object; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; classifying discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized; and driving the VR equipment or the AR equipment to render and display the recognition result of the sign language action to be recognized.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: acquiring an image frame set by calling a first interface, wherein the first interface comprises a first parameter, the parameter value of the first parameter is the image frame set, and the image frame set is acquired in the process of outputting a sign language action to be recognized by a biological object; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; classifying discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized; and outputting the recognition result of the sign language action to be recognized by calling a second interface, wherein the second interface comprises a second parameter, and the parameter value of the second parameter is the recognition result of the sign language action to be recognized.
In the embodiment of the application, firstly, an image frame set in the process of outputting a sign language action to be recognized by a biological object is collected; discretizing and coding the image frames in the image frame set to obtain discrete characteristics of the sign language action to be recognized, wherein the discrete characteristics are used for representing the characteristics of the vocabulary to be recognized represented by the sign language action to be recognized; and classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized, so that the accuracy of the recognition result of the sign language action to be recognized is improved. It is easy to notice that the discrete coding can be performed on the image frames in the image frame set to obtain the discrete characteristics of the gesture language action to be recognized, and the characteristics of the gesture language vocabulary granularity can be better modeled, so that the gesture language action to be recognized is more accurately recognized, and the technical problem of lower recognition accuracy of the gesture language action in the related technology is solved.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be an indirect coupling or communication connection through some interfaces, units or modules, and may be electrical or in other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (14)

1. A sign language action recognition method is characterized by comprising the following steps:
collecting an image frame set in the process of outputting sign language actions to be recognized by a biological object;
discretizing and coding the image frames in the image frame set to obtain discrete features of the sign language action to be recognized, wherein the discrete features are used for representing the features of the vocabulary to be recognized represented by the sign language action to be recognized;
and classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized.
2. The method of claim 1, wherein discretizing the image frames in the image frame set to obtain discrete features of the sign language action to be recognized comprises:
and carrying out discretization coding on the image frame by using a discretization coding model to obtain the discrete characteristics of the sign language action to be recognized.
3. The method of claim 2, wherein the discretized coding model comprises: the method comprises an encoder model and a discretization word list, wherein the discretization encoding is carried out on the image frame by utilizing a discretization encoding model, so that the discretization characteristics of the sign language action to be recognized are obtained, and the method comprises the following steps:
performing feature extraction on the image frame by using the encoder model to obtain image features of the image frame;
discretizing the image features of the image frame to obtain intermediate features corresponding to the image features;
and acquiring preset features corresponding to the intermediate features from the discretization word list to obtain the discrete features of the sign language action to be recognized, wherein the discretization word list stores preset features of preset quantity, and the arrangement combination of the preset features of different quantity is used for representing the vocabulary expressed by different sign language actions.
4. The method according to claim 2, wherein the model parameter of the discretized coding model is adjusted based on a first loss value, the first loss value is constructed based on a training image frame set and a reconstructed image frame set, the reconstructed image frame set is obtained by image reconstructing a splicing feature by using a decoder model, the splicing feature is obtained by splicing an image feature of a reference image frame in the training image frame set with a discrete feature of a training sign language action included in the training image frame set, the image feature of the reference image frame is obtained by feature extraction of the reference image frame by using a feature extraction model, and the discrete feature of the training sign language action is obtained by discretizing and coding a training image frame in the training image frame set by using the discretized coding model.
5. The method according to claim 1, wherein classifying the discrete features of the sign language action to be recognized to obtain the recognition result of the sign language action to be recognized comprises:
and classifying the discrete features of the sign language action to be recognized by using a recognition model to obtain the recognition result.
6. The method of claim 5, wherein the model parameters of the recognition model are adjusted based on a second loss value, the second loss value is determined based on the recognition result of the sign language action and the received feedback result, and the feedback result is obtained by modifying the recognition result after the recognition result is output.
7. A method for training a discretized coding model, comprising:
acquiring a training image frame set, wherein the training image frame set comprises training sign language actions output by the same biological object;
carrying out discretization coding on the training image frames in the training image frame set by using a discretization coding model to obtain discrete characteristics of the training sign language action;
utilizing a decoder model to carry out image reconstruction on the discrete features of the training sign language action to obtain a reconstructed image frame set;
adjusting model parameters of the discretized coding model and the decoder model based on the training image frame set and the reconstructed image frame set.
8. The method of claim 7, wherein reconstructing the image of the discrete features of the gesture language action using a decoder model to obtain a reconstructed image frame set comprises:
performing feature extraction on a reference image frame in the training image frame set by using a feature extraction model to obtain image features of the reference image frame;
splicing the image features of the reference image frame with the discrete features of the training sign language action to obtain spliced features;
and performing image reconstruction on the splicing characteristics by using the decoder model to obtain the reconstructed image frame set.
9. The method of claim 7, wherein the discretized coding model comprises: the method comprises an encoder model and a discretization word list, wherein a discretization coding model is used for discretizing and coding training image frames in the training image frame set to obtain discrete features of the training sign language action, and the method comprises the following steps:
performing feature extraction on the training image frame by using the encoder model to obtain image features of the training image frame;
discretizing the image features of the training image frame to obtain intermediate features corresponding to the image features;
and acquiring preset features corresponding to the intermediate features from the discretization word list to obtain the discrete features of the training sign language actions, wherein preset features of preset quantity are stored in the discretization word list, and the arrangement combination of the preset features of different quantity is used for representing the vocabulary expressed by different sign language actions.
10. A sign language action recognition method is characterized by comprising the following steps:
responding to an input instruction acting on an operation interface, and displaying an image frame set on the operation interface, wherein the image frame set is acquired in the process of outputting a sign language action to be recognized by a biological object;
and displaying a recognition result of the sign language action to be recognized on the operation interface in response to a recognition instruction acting on the operation interface, wherein the recognition result is used for representing the category of the vocabulary to be recognized, the recognition result is obtained by classifying discrete features of the sign language action to be recognized, the discrete features of the sign language action to be recognized are obtained by performing discretization coding on the image frames in the image frame set, and the discrete features are used for representing the features of the vocabulary to be recognized, which are represented by the sign language action to be recognized.
11. A sign language action recognition method is characterized by comprising the following steps:
displaying an image frame set on a presentation picture of a Virtual Reality (VR) device or an Augmented Reality (AR) device, wherein the image frame set is acquired in the process of outputting a sign language action to be recognized by a biological object;
discretizing and coding the image frames in the image frame set to obtain discrete features of the sign language action to be recognized, wherein the discrete features are used for representing the features of the vocabulary to be recognized represented by the sign language action to be recognized;
classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized;
and driving the VR equipment or the AR equipment to render and display the recognition result of the sign language action to be recognized.
12. A method for recognizing sign language actions is characterized by comprising the following steps:
acquiring an image frame set by calling a first interface, wherein the first interface comprises a first parameter, the parameter value of the first parameter is the image frame set, and the image frame set is acquired in the process of outputting a sign language action to be recognized by a biological object;
discretizing and coding the image frames in the image frame set to obtain discrete features of the sign language action to be recognized, wherein the discrete features are used for representing the features of the vocabulary to be recognized represented by the sign language action to be recognized;
classifying the discrete features of the sign language action to be recognized to obtain a recognition result of the sign language action to be recognized, wherein the recognition result is used for representing the category of the vocabulary to be recognized;
and outputting the recognition result of the sign language action to be recognized by calling a second interface, wherein the second interface comprises a second parameter, and the parameter value of the second parameter is the recognition result of the sign language action to be recognized.
13. A computer-readable storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method of any one of claims 1 to 12.
14. An electronic device, comprising:
the acquisition device is used for acquiring an image frame set in the process that the biological object outputs the sign language action to be recognized;
a processor running a program, wherein the program is run to perform the method of any one of claims 1 to 12 on data output from the acquisition device.
CN202211304075.XA 2022-10-24 2022-10-24 Sign language action recognition method and discretization coding model training method Pending CN115937968A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211304075.XA CN115937968A (en) 2022-10-24 2022-10-24 Sign language action recognition method and discretization coding model training method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211304075.XA CN115937968A (en) 2022-10-24 2022-10-24 Sign language action recognition method and discretization coding model training method

Publications (1)

Publication Number Publication Date
CN115937968A true CN115937968A (en) 2023-04-07

Family

ID=86647993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211304075.XA Pending CN115937968A (en) 2022-10-24 2022-10-24 Sign language action recognition method and discretization coding model training method

Country Status (1)

Country Link
CN (1) CN115937968A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118097323A (en) * 2024-04-22 2024-05-28 阿里巴巴达摩院(杭州)科技有限公司 Training method of autoregressive generating model, image processing method and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118097323A (en) * 2024-04-22 2024-05-28 阿里巴巴达摩院(杭州)科技有限公司 Training method of autoregressive generating model, image processing method and electronic equipment
CN118097323B (en) * 2024-04-22 2024-10-18 阿里巴巴达摩院(杭州)科技有限公司 Training method of autoregressive generating model, image processing method and electronic equipment

Similar Documents

Publication Publication Date Title
US12067690B2 (en) Image processing method and apparatus, device, and storage medium
CN110968736B (en) Video generation method and device, electronic equipment and storage medium
EP3657300A2 (en) Real-time gesture detection and recognition
CN110991427A (en) Emotion recognition method and device for video and computer equipment
US10678855B2 (en) Generating descriptive text contemporaneous to visual media
CN113766299B (en) Video data playing method, device, equipment and medium
US20220405524A1 (en) Optical character recognition training with semantic constraints
US20220067546A1 (en) Visual question answering using model trained on unlabeled videos
JP2024536014A (en) Optimizing Lip Sync for Natural Language Translation Video
CN115937968A (en) Sign language action recognition method and discretization coding model training method
WO2024153010A1 (en) Method for generating virtual image, and storage medium
CN116704405B (en) Behavior recognition method, electronic device and storage medium
CN115563334A (en) Method and processor for processing image-text data
CN114092608B (en) Expression processing method and device, computer readable storage medium and electronic equipment
CN117808934A (en) Data processing method and related equipment
CN114579806B (en) Video detection method, storage medium and processor
US11340763B2 (en) Non-linear navigation of videos
CN113704544A (en) Video classification method and device, electronic equipment and storage medium
CN115914511A (en) Method for generating limb movement, computer-readable storage medium and electronic device
CN117290534B (en) Method and device for generating story album and electronic equipment
CN118227910B (en) Media resource aggregation method, device, equipment and storage medium
CN114666307B (en) Conference interaction method, conference interaction device, equipment and storage medium
CN117373455B (en) Audio and video generation method, device, equipment and storage medium
CN115937368A (en) Virtual character generation method and video identification method
CN115713938A (en) Confidence estimation method for speech recognition, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination