CN114821807A - Sign language identification method and device and server - Google Patents

Sign language identification method and device and server Download PDF

Info

Publication number
CN114821807A
CN114821807A CN202210560752.8A CN202210560752A CN114821807A CN 114821807 A CN114821807 A CN 114821807A CN 202210560752 A CN202210560752 A CN 202210560752A CN 114821807 A CN114821807 A CN 114821807A
Authority
CN
China
Prior art keywords
target
sign language
processing
preset
illumination
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210560752.8A
Other languages
Chinese (zh)
Inventor
李婧蕊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202210560752.8A priority Critical patent/CN114821807A/en
Publication of CN114821807A publication Critical patent/CN114821807A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The specification provides a sign language identification method, a sign language identification device and a server. The method is applied to the technical field of artificial intelligence, and before specific implementation, a preset sign language recognition model which can be suitable for processing images in a complex low-illumination environment can be obtained through pre-training; wherein, the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph. In specific implementation, after target image data containing sign language actions of a target user are acquired; firstly, constructing a corresponding target thermodynamic diagram comprising a plurality of frames of thermodynamic diagrams according to target image data; processing the target thermal map to obtain a plurality of full-connection maps; and processing the plurality of full-connected graphs by using a preset sign language recognition model to obtain a target semantic recognition result, so that the target semantic content represented by the sign language action of the target user can be accurately determined.

Description

Sign language identification method and device and server
Technical Field
The specification belongs to the technical field of artificial intelligence, and particularly relates to a sign language recognition method, a sign language recognition device and a server.
Background
In an interactive scenario based on a mobile phone application, sometimes a user needs to express a question or consult a question through sign language action.
However, the image data acquired by the mobile phone in some complex low-illumination environments is often poor in quality, and errors are prone to occur and accuracy is low when semantic recognition is performed on sign language actions in the image data based on the existing method.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The specification provides a sign language identification method, a sign language identification device and a server, which can perform relatively accurate semantic identification on sign language actions in image data under a complex low-illumination environment, obtain a semantic identification result with relatively high accuracy and reduce identification errors.
The present specification provides a sign language identification method, including:
acquiring target image data containing sign language actions of a target user;
constructing and obtaining a corresponding target thermal map according to the target image data; wherein the target thermodynamic map comprises a plurality of frames of thermodynamic diagrams;
processing the target thermal map to obtain a plurality of full-connection maps;
processing the multiple full-connected graphs by using a preset sign language recognition model to obtain a target semantic recognition result; the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph;
and determining the target semantic content represented by the sign language action of the target user according to the target semantic recognition result.
In one embodiment, processing the target thermal map to obtain a plurality of fully connected maps comprises:
filtering a plurality of thermodynamic diagrams contained in the target thermodynamic diagram by using a Gaussian filter function to obtain peak data of the thermodynamic diagrams;
and processing a plurality of thermodynamic diagrams contained in the target thermodynamic diagram and peak data of the thermodynamic diagrams by using a preset thermodynamic diagram processing model to obtain a plurality of full-connected diagrams.
In one embodiment, the preset thermodynamic diagram processing model comprises a joint hotspot diagram detector and a space sub-model connected in series; wherein the joint heat map detector is used for determining and outputting a corresponding joint heat map according to the input thermodynamic diagram and peak data of the thermodynamic diagram; and the space submodel is used for constructing and outputting a corresponding full-connection graph according to the input joint point heat map.
In one embodiment, the joint point heat map detector comprises a model structure trained based on a deep neural network model; the spatial submodel comprises a model structure obtained based on training of an improved Markov random field spatial model.
In one embodiment, the low-light processing network layer includes at least: decomposing the submodel, adjusting the submodel and rebuilding the submodel; wherein the decomposition submodel is configured to decompose the illumination component and the reflection component from the fully-connected graph; the adjustment submodel is used for correcting the illumination component and the reflection component and outputting the corrected illumination component and the corrected reflection component; and the reconstruction submodel is used for reconstructing the full connection image according to the corrected illumination component and the corrected reflection component so as to obtain the corrected full connection image.
In one embodiment, the preset sign language recognition model comprises a model structure trained based on YOLO v 5.
In one embodiment, the preset sign language recognition model is further connected with a classifier; the classifier is also respectively connected with a top view sub-model and a bottom view sub-model.
In one embodiment, the method further comprises:
acquiring sample data; the sample data comprises a sample image which is collected under a low-illumination environment and contains a gesture and a sample image which is collected under a normal-illumination environment and contains a gesture;
processing the sample data according to a preset processing rule to obtain a training sample set;
constructing an initial sign language recognition model; wherein the initial sign language recognition model at least comprises an initial low-illumination processing network layer;
and training the initial sign language recognition model by using the training sample set to obtain a preset sign language recognition model.
In one embodiment, processing the sample data according to a preset processing rule to obtain a training sample set includes:
carrying out preset data enhancement processing on the sample image to obtain a sample image subjected to data enhancement processing;
marking the positions of the joint points, the connection relation among the joint points and the semantic content of the sample image in the sample image subjected to data enhancement processing to obtain a marked sample image;
and combining the marked sample images to obtain a training sample set.
In one embodiment, the preset data enhancement process includes at least one of: image flipping processing, image scaling processing and image rotation processing.
In one embodiment, after obtaining the target semantic content, the method further comprises:
determining a target text for replying a target user according to the target semantic content;
and conveying the semantic content represented by the target text to a target user.
In one embodiment, communicating semantic content characterized by the target text to a target user comprises:
inquiring a preset video database, and determining a sign language action video matched with the target text as a target sign language video; displaying the target sign language video to the target user;
and/or the presence of a gas in the gas,
inquiring a preset audio database, and determining a voice audio matched with the target text as a target voice audio; and playing the target voice audio to the target user.
This specification also provides a sign language recognition apparatus including:
the acquisition module is used for acquiring target image data containing sign language actions of a target user;
the construction module is used for constructing and obtaining a corresponding target thermal map according to the target image data; wherein the target thermodynamic map comprises a plurality of frames of thermodynamic diagrams;
the first processing module is used for processing the target thermal map to obtain a plurality of full-connection maps;
the second processing module is used for processing the full-connection graphs by using a preset sign language recognition model to obtain a target semantic recognition result; the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph;
and the determining module is used for determining the target semantic content represented by the sign language action of the target user according to the target semantic recognition result.
The present specification also provides a server comprising a processor and a memory for storing processor-executable instructions, the processor performing the steps associated with the sign language identification method.
The present specification also provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, perform the steps associated with the sign language identification method.
Based on the sign language recognition method, the sign language recognition device and the server provided by the specification, before specific implementation, a preset sign language recognition model which can be suitable for processing images in a complex low-illumination environment can be obtained through pre-training; wherein, the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph with higher quality. In specific implementation, after target image data containing sign language actions of a target user are obtained; firstly, a corresponding target thermodynamic diagram containing multi-frame thermodynamic diagrams is constructed and obtained according to target image data; processing the target thermal map to obtain a plurality of full-connection maps; and processing the full-link graphs by using a preset sign language recognition model to obtain a target semantic recognition result, and determining target semantic content represented by the sign language action of the target user according to the target semantic recognition result. Therefore, on one hand, target image data are processed by introducing heat map regression, spatial information in the image can be fully utilized, and abundant relevant characteristics of the joint points can be learned, so that the image data under a complex background environment can be well processed; on the other hand, the full-connected graph which is good in effect, high in quality and enhanced in illumination is obtained by performing illumination correction on the full-connected graph to be processed based on the illumination component and the reflection component in the image through training and by using the low-illumination processing network layer in the preset sign language recognition model, the corrected full-connected graph is used, the image features are extracted, processed and recognized based on the corrected full-connected graph, and the image data in the low-illumination environment can be well processed. Therefore, more accurate semantic recognition processing can be carried out on the sign language actions in the image data with poor quality in a complex low-illumination environment, a semantic recognition result with higher accuracy is obtained, and recognition errors are reduced. Further, after sample data is obtained, the preset data enhancement processing is carried out on the sample data to expand the limited sample data, so that a training sample set which is richer and has enhanced data is obtained; furthermore, the model can be trained by utilizing the training sample set so as to enhance the generalization capability and robustness of the model and obtain a preset sign language recognition model with relatively higher precision and relatively better effect.
Drawings
In order to more clearly illustrate the embodiments of the present specification, the drawings needed to be used in the embodiments will be briefly described below, and the drawings in the following description are only some of the embodiments described in the specification, and it is obvious to those skilled in the art that other drawings can be obtained based on the drawings without any inventive work.
FIG. 1 is a flow diagram of a sign language identification method provided by an embodiment of the present specification;
FIG. 2 is a diagram illustrating an embodiment of the structural components of a system to which the sign language recognition method provided by the embodiments of the present specification is applied;
FIG. 3 is a diagram illustrating an embodiment of a sign language recognition method provided by an embodiment of the present specification;
FIG. 4 is a diagram illustrating an embodiment of a sign language recognition method provided by an embodiment of the present specification;
FIG. 5 is a diagram illustrating an example scenario in which an embodiment of a sign language recognition method provided by an embodiment of the present specification is applied;
FIG. 6 is a diagram illustrating an embodiment of a sign language recognition method provided by an embodiment of the present specification;
FIG. 7 is a schematic diagram of a server according to an embodiment of the present disclosure;
fig. 8 is a schematic structural component diagram of a sign language recognition apparatus provided in an embodiment of the present specification;
fig. 9 is a flowchart illustrating a sign language identification method according to an embodiment of the present specification.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.
Referring to fig. 1, a sign language recognition method is provided in the present specification. The method is particularly applied to the server side. In specific implementation, the method may include the following:
s101: acquiring target image data containing sign language actions of a target user;
s102: constructing and obtaining a corresponding target thermal map according to the target image data; wherein the target thermodynamic diagram comprises a multi-frame thermodynamic diagram;
s103: processing the target thermal map to obtain a plurality of fully-connected maps;
s104: processing the multiple full-connected graphs by using a preset sign language recognition model to obtain a target semantic recognition result; the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph;
s105: and determining the target semantic content represented by the sign language action of the target user according to the target semantic recognition result.
According to the embodiment, the corresponding target thermodynamic spectrum can be constructed and obtained according to the target image data based on thermodynamic diagram regression; and a full connection diagram containing rich relevant characteristics of the joint points is obtained by processing the target thermal map; and then, the fully-connected graph is processed by utilizing a pre-trained preset sign language recognition model at least comprising a low-illumination processing network layer, so that the sign language actions in the image data under the complex low-illumination environment can be relatively accurately recognized semantically, a semantic recognition result with higher accuracy is obtained, and semantic recognition errors are reduced.
In some embodiments, the sign language recognition method can be specifically applied to the server side. As can be seen in fig. 2.
In this embodiment, the server may specifically include a background server applied to a network service platform (e.g., a network service platform of an XX electronic bank) side and capable of implementing functions such as data transmission and data processing. Specifically, the server may be, for example, an electronic device having data operation, storage function and network interaction function. Alternatively, the server may be a software program running in the electronic device and providing support for data processing, storage and network interaction. In the present embodiment, the number of servers is not particularly limited. The server may specifically be one server, or may also be several servers, or a server cluster formed by several servers.
In some embodiments, the target video data may be video data including a gesture-based action captured by a target user through a terminal device. For example, a short video including a sign language movement of the target user, a moving image including a sign language movement of the target user, or the like.
In this embodiment, referring to fig. 2, the terminal device may specifically include a front end that is applied to a user side and is capable of implementing functions such as image data acquisition and image data transmission. Specifically, the terminal device may be, for example, an electronic device such as a desktop computer, a tablet computer, a notebook computer, and a mobile phone. Alternatively, the terminal device may be a software application capable of running in the electronic device. For example, it may be a client APP of an XX electronic bank running on a mobile phone, or the like.
In some embodiments, referring to fig. 2, when a target user needs to express a specific statement or ask a question to a network service platform, the target user may perform a corresponding sign language action to a camera of a terminal device to express a specific statement or ask a specific question of the target user. Correspondingly, the terminal equipment can collect target image data containing sign language actions of target users and send the target image data to a server of the network service platform in a wired or wireless mode. Correspondingly, the server receives and acquires the target image data.
Specifically, for example, user a has a language barrier and does not type. When a user A wants to inquire the consumption record of the user A in the month through a network service platform of the XX electronic bank, the client of the XX electronic bank in the mobile phone can be opened, and the user A clicks to enter a customer service consultation interface. In the customer service consultation interface, user a can select a video consultation option. Correspondingly, the mobile phone receives and responds to the selection operation of the user, and the camera of the mobile phone is automatically opened. At this time, the user a can make a corresponding sign language action to the mobile phone camera to express the request of inquiring the consumption record of the month. The mobile phone camera can shoot a video containing the sign language action of the user A as target image data; and sending the target image data to a server.
Because the user A shoots the target image data at home, the background of the home of the user A is complex, the indoor light is relatively dark, and the shooting effect of the mobile phone camera is relatively poor compared with that of a professional camera, the quality of the target image data provided for the server is relatively poor.
After receiving the target image data, the server may convert the target image data into a corresponding target thermodynamic diagram containing multiple frames of thermodynamic diagrams based on thermodynamic diagram regression; and processing the target thermodynamic spectrum by using a preset thermodynamic diagram processing model to obtain a plurality of connection diagrams. Since the original target image data is obtained in a complex low-light environment, the data quality is relatively poor. Accordingly, the data quality of the fully connected graph obtained at this time is relatively poor.
In the process of performing sign language action semantic recognition on the multiple fully-connected graphs, the server may first use a low-illumination processing network layer in a preset sign language recognition model to resolve an illumination component and a reflection component from the fully-connected graphs, and perform targeted correction on the illumination component and the reflection component; and then constructing a corrected full-connected graph with enhanced illumination by using the corrected illumination component and the corrected reflection component. At this time, the full-connection graph with relatively poor data quality can be converted into the corrected full-connection graph with relatively good data quality. Furthermore, the server may reuse a preset sign language recognition model to perform recognition processing on the basis of the corrected full-link diagram, so as to accurately determine that the target semantic content represented by the sign language action of the user a is: "inquire the consumption record of the month".
Further, the server may perform corresponding query according to the target semantic content to obtain a consumption record about the current month of the user a, and determine the consumption record as the target text content matched with the target semantic content.
Then, the server can utilize the trained text-to-speech conversion model to process the target text content to obtain a target speech audio corresponding to the target text content; and then the target voice audio is sent to the mobile phone of the user A.
Accordingly, reference may be made to FIG. 2. And the mobile phone of the user A receives the target voice audio and plays the target voice audio to the user A through the voice player of the mobile phone.
In some embodiments, the target image data may specifically include a plurality of frames of images arranged in sequence. Wherein each frame of image may contain the gesture posture of the target user. Gesture gestures in the multi-frame images are arranged according to the time sequence to form sign language actions of the target user.
The target image data may specifically include image data acquired in a complex low-illumination environment, such image data is generally fuzzy, has more noise and relatively poor image quality, and when semantic recognition is performed on sign language actions in such image data based on the existing method, errors are large and precision is low. The complex low-illumination environment can be specifically understood as an environment with weak illumination intensity, complex background and even easy occurrence of shielding.
In some embodiments, when implemented, the corresponding target thermal map may be obtained by predicting a probability spectrum of each joint in a multi-frame image included in the target image data. Wherein the target thermodynamic image comprises a plurality of frames of thermodynamic diagrams.
In some embodiments, the constructing and obtaining the corresponding target thermal map according to the target image data may include, in specific implementation, processing a current frame image in the target image data in the following manner to obtain a current frame thermal map in the corresponding target thermal map:
s1: determining discrete points in the current frame image; and creating a corresponding buffer zone based on the corresponding radius by taking the discrete point as the center;
s2: filling a buffer area of each discrete point in the current frame image from shallow to deep from inside to outside by using a progressive gray band (wherein the range of the complete gray band is 0-255) to obtain the filled image of the current frame;
s3: and mapping colors by using a preset color band in the image filled with the current frame by taking the gray value as an index, and re-coloring the image to obtain the thermodynamic diagram of the current frame.
In specific implementation, the gray values may be superimposed during the process of performing the filling operation on the current frame image according to the above-mentioned method. Generally, the larger the grayscale value, the brighter the color is, and the whiter the appears in the grayscale band. In practical operation, any channel in the ARGB model can be selected as the superimposed gray value. After the filling operation is completed, in the image after the current frame is filled, the gray value is superimposed on the area with the crossed buffer areas, and when the buffer areas are more crossed, the gray value is relatively larger, and the area is relatively hotter.
The preset color ribbon may be a color ribbon containing 256 colors.
According to the method for processing the current frame image, other frame images in the target image data can be processed to obtain a target thermal map containing a plurality of frames of thermodynamic diagrams.
In some embodiments, the processing the target thermal map to obtain a plurality of fully connected maps may include the following:
s1: filtering a plurality of thermodynamic diagrams contained in the target thermodynamic diagram by using a Gaussian filter function to obtain peak data of the thermodynamic diagrams;
s2: and processing a plurality of thermodynamic diagrams contained in the target thermodynamic diagram and peak data of the thermodynamic diagrams by using a preset thermodynamic diagram processing model to obtain a plurality of full-connected diagrams.
Through the embodiment, the full-connection graphs with good effect can be obtained by filtering and then performing thermodynamic regression by using the preset thermodynamic processing model.
In some embodiments, the peak data of the thermodynamic diagram may specifically include: peak value of thermodynamic diagram, confidence of peak value, position information of joint point at peak value, etc.
In some embodiments, referring to fig. 3, the predetermined thermodynamic diagram processing model may specifically include a joint hotspot diagram detector and a spatial sub-model connected in series; the joint heat map detector can be specifically used for determining and outputting a corresponding joint heat map according to the input thermodynamic diagram and peak data of the thermodynamic diagram; the spatial submodel may be specifically configured to construct and output a corresponding fully-connected graph from an input joint heat map.
By the embodiment, the joint hot spot diagram detector in the preset thermodynamic diagram processing model can be used for determining and outputting the joint hot spot diagram by processing the input thermodynamic diagram and peak data of the thermodynamic diagram; and then, processing the space sub-model in the model by using a preset thermodynamic diagram, and determining and outputting a full connection diagram by processing the joint heat map input by the joint heat map detector.
In some embodiments, the joint point heat map detector may specifically include a model structure trained based on a deep neural network model, or the like; the spatial submodel may specifically include a model structure obtained by training based on an improved markov random field spatial model, and the like.
By the embodiment, the joint point heat map detector obtained by training based on the deep neural network model can be used for accurately determining and obtaining the joint point heat map by positioning the human body joint points; meanwhile, the joint points and other joint points can be paired by utilizing a space sub-model obtained by training based on the improved Markov random field space model based on the interconnection among the joint points, and a full-connection diagram with a better effect is created.
In some embodiments, before implementation, the preset thermodynamic diagram processing model may be trained as follows:
s1: constructing an initial first structure based on a deep neural network model; constructing an initial second structure based on the improved Markov random field space model;
s2: acquiring and training an initial first structure by using a thermodynamic diagram sample which has overlapped receptive fields and multiple resolutions and contains human body joint point positioning information to obtain a first structure capable of identifying and outputting a joint point thermal diagram;
s3: processing the thermodynamic map sample by using the first structure to obtain a joint point heat map sample; training an initial second structure by using the joint point heat map sample to obtain a second structure capable of creating a fully-connected graph by using the joint point heat map;
s4: establishing a combined model according to the first structure and the second structure;
s5: and based on the combined model, a preset thermodynamic diagram processing model containing the related joint heat point diagram detector and the space sub-model is obtained by carrying out back propagation on the whole network.
Through the embodiment, the preset thermodynamic diagram processing model with good effect can be effectively trained.
In some embodiments, referring to fig. 4, the low-light processing network layer may include at least: decomposing the submodel, adjusting the submodel and rebuilding the submodel; wherein the decomposition submodel is configured to decompose the illumination component and the reflection component from the fully-connected graph; the adjustment submodel is used for correcting the illumination component and the reflection component and outputting the corrected illumination component and the corrected reflection component; and the reconstruction submodel is used for reconstructing the full connection image according to the corrected illumination component and the corrected reflection component so as to obtain the corrected full connection image.
Through the embodiment, when the preset sign language recognition model is used specifically, the illumination component (which can be recorded as I) and the reflection component (which can be recorded as R) can be decomposed from the full-connected graph by using the decomposition submodel in the low-illumination processing network layer in the preset sign language recognition model; respectively carrying out targeted correction on the illumination component and the reflection component by utilizing an adjusting submodel in the low-illumination processing network layer to obtain a corrected illumination component and a corrected reflection component; and reconstructing a full-connected graph based on the corrected illumination component and the corrected reflection component by using a reconstruction sub-model in the low-illumination processing layer, so that a corrected full-connected graph with higher image quality equivalent to that under a normal illumination environment can be obtained. Furthermore, the preset sign language recognition model can be reused to carry out semantic recognition on the sign language action based on the corrected full-connection graph, so that the error in the recognition process can be effectively reduced, and the recognition accuracy is improved.
In some embodiments, the preset sign language recognition model may specifically include a model structure trained based on YOLO v5, and the like.
Of course, in specific implementation, the preset sign language recognition model may also be constructed by using YOLO v3 or YOLO v4 according to specific situations and processing requirements.
It should be noted that, by using YOLO v5, compared to YOLO v3 or YOLO v4, when training a preset sign language recognition model, the feature of transfer learning based on pre-training weights can be better utilized, so that the model training is relatively lighter and faster.
Through the embodiment, the YOLO v5 can be used as an initial basic model structure, and a preset sign language recognition model meeting the requirements can be obtained through efficient and convenient training.
The above Yolo may be specifically understood as a convolutional neural network for real-time object detection. The following 3 main components may be specifically included: the backbone component is mainly used for extracting important features of the image; the neck component mainly uses a characteristic pyramid, and better performance is obtained on invisible data by summarizing the scaling of an object; the model head assembly is primarily used to perform the actual detection portion, where the assembly can apply the anchor block to the elements that generate the output vector; the vector may specifically include a class probability, an objectivity score, a bounding box, and the like.
In some embodiments, before implementation, referring to fig. 5, the method may further include the following:
s1: acquiring sample data; the sample data comprises a sample image which is collected under a low-illumination environment and contains a gesture and a sample image which is collected under a normal-illumination environment and contains a gesture;
s2: processing the sample data according to a preset processing rule to obtain a training sample set;
s3: constructing an initial sign language recognition model; wherein the initial sign language recognition model at least comprises an initial low-light processing network layer;
s4: and training the initial sign language recognition model by using the training sample set to obtain a preset sign language recognition model.
Through the embodiment, the acquired sample data can be effectively utilized, and the preset sign language recognition model meeting the precision requirement can be efficiently trained.
In some embodiments, the sample data may be specifically understood as a sample image containing a gesture related to a sign language action.
Further, the sample data may specifically include a sample image acquired in a complex low-light environment. Specifically, the sample image may be a 720p or 1080p image acquired by a camera of a mobile device such as a mobile phone in a specified environment.
In some embodiments, considering that sign language expression mostly uses the upper half of the body to send out motion, the bone data can be further used to segment out 15 joint points of the human body from the bottom of the spine to the top, as the joint points concerned in recognition, including: head, neck, cervical spine, right shoulder, right elbow joint, right wrist joint, right thumb, right hand, right fingertip, right shoulder, right elbow joint, right wrist joint, right thumb, right hand, right finger fingertip. Correspondingly, images at least containing the 15 joint points can be collected in a targeted mode to serve as sample data.
In some embodiments, referring to fig. 6, the processing the sample data according to the preset processing rule to obtain the training sample set may include the following steps:
s1: carrying out preset data enhancement processing on the sample image to obtain a sample image subjected to data enhancement processing;
s2: marking the positions of the joint points, the connection relation among the joint points and the semantic content of the sample image in the sample image subjected to data enhancement processing to obtain a marked sample image;
s3: and combining the marked sample images to obtain a training sample set.
During specific labeling, the sample image and the sample image subjected to data enhancement processing can be labeled at the same time to obtain a labeled sample image.
Through the embodiment, the preset data enhancement processing is carried out on the sample data, so that the relatively limited sample data can be effectively expanded, and a richer training sample set is obtained; furthermore, the model can be trained by utilizing the training sample set so as to enhance the generalization capability and robustness of the model and obtain a preset sign language recognition model with relatively higher precision and relatively better effect.
In some embodiments, the preset data enhancement processing may specifically include at least one of: image flipping processing, image scaling processing, image rotation processing, and the like.
The image flipping process may specifically be mirror-like flipping process performed on the image data. The above-described image scaling processing (e.g., image random scaling processing) may specifically refer to respectively reducing and enlarging an image at a ratio of 1:2, and maintaining the sharpness of the image by controlling the scaling. The image rotation processing may specifically be rotation of an image by a certain angle.
Through the embodiment, various types of data enhancement processing can be performed on the sample image, so that relatively limited sample data can be fully and effectively utilized, and a training sample set with rich samples can be obtained through expansion.
In some embodiments, the decomposition submodel may specifically include 5 convolutional layers with ReLU.
In some embodiments, in the process of training the preset sign language recognition model, for the decomposition sub-model, the pair of the low-illumination sample image and the normal-illumination sample image may be screened out according to the training sample set to perform targeted training on the decomposition sub-model. The low-light sample image and the normal-light sample image included in the low-light sample image and the normal-light sample image may be images acquired in a low-light environment and a normal-light environment, respectively.
When the decomposition submodel is specifically trained, a first target loss function aiming at the decomposition submodel can be constructed. Wherein the first objective loss function may include: a reconstruction loss term, a reflection component consistency loss term, and an illumination component smoothing loss term.
Correspondingly, when the decomposition submodel is specifically trained, the illumination component I of low illumination in the low illumination sample image can be specifically learned by utilizing the low illumination sample image and the normal illumination sample image to be matched based on the first target loss function low Low light reflection component R low And the illumination component I of the normal illumination in the normal illumination sample image normal Normal illumination reflection component R normal Constraint relationships between the four components; and continuously optimizing and improving the model parameters based on the constraint relation to obtain a decomposition sub-model meeting the requirements.
In some embodiments, the first objective loss function may be expressed in the form of:
L=L reconir L iris L is
wherein L is a first target loss function value, L recon To reconstruct the loss term, λ ir Is the first coefficient, L ir For the reflected component uniformity loss term, λ is Is the second coefficient, L is The loss term is smoothed for the illumination component.
In some embodiments, the above reconstruction loss term may be expressed in the form of:
Figure BDA0003656478760000111
wherein λ represents a third coefficient, R i Representing the reflected component of low or normal illumination, I j Representing the lighting component of low or normal illumination, S j An output image representing low or normal illumination after model processing,
Figure BDA0003656478760000112
representing a cross product operation.
In some embodiments, the reflected component consistency loss term may be expressed in the form:
L ir =||R low -R normal || 1
in this embodiment, it can be known from Retinex image decomposition theory that the reflection component R is independent of illumination, so the reflection components R of the paired low-illumination image and normal-illumination image should be as consistent as possible.
In some embodiments, the illumination component smoothing loss term may be specifically expressed in the form of:
Figure BDA0003656478760000113
wherein λ is g Which represents the fourth coefficient of the coefficient,
Figure BDA0003656478760000114
representing a cross product operation.
In this embodiment, based on the illumination component smoothing loss term, it is specifically expressed as: an ideal illumination component I should be as smooth as possible in texture detail while at the same time being better preserved in overall structure. Formally, from the above expression: the loss term assigns weights to the gradient map of the illumination component I by graduating the reflection component R, so that smoother areas on the reflection component R correspond to the illumination component I and should likewise be as smooth as possible.
In some embodiments, the adjustment submodel performs a targeted correction process on the reflected component of low illumination and the illumination component of low illumination output by the decomposition submodel, respectively. In particular, for the reflected component of low illumination, the amplified noise therein may be specifically suppressed according to the corresponding illumination strategy. For the illumination component with low illumination, an encoder-decoder framework can be utilized, multi-scale connection is introduced, and context information about illumination distribution in a large range is obtained and utilized for self-adaptive adjustment.
In some embodiments, in the process of training the preset sign language recognition model, a second target loss function can be further constructed for the adjustment submodel; based on a second target loss function, carrying out targeted training on the adjusting sub-model; wherein the second target loss function may specifically comprise a reconstruction loss term and an illumination component smoothing loss term.
In some embodiments, in implementation, the reconstruction sub-model may be used to reconstruct and obtain the full-connected graph of normal illumination by multiplying the corrected illumination component of low illumination and the corrected reflection component of low illumination, so that the low-illumination full-connected graph with poor quality may be converted into the normal-illumination full-connected graph with better quality.
In some embodiments, referring to fig. 4, the preset sign language recognition model may further be connected with a classifier; the classifier is also connected with a top view sub-model and a bottom view sub-model.
The classifier can specifically classify 24 subclasses related to the characteristics of the joint points for the input multiple fully-connected graphs; and judging the similarity degree among the gestures in the full-connection graphs according to the classification results of the 24 subclasses so as to determine whether the similarity degree of the gestures in the full-connection graphs is larger than a preset similarity degree threshold value. The preset similarity degree threshold value can be obtained by learning a large amount of sample data. The top view sub-model and the bottom view sub-model are pre-trained model structures capable of performing semantic recognition based on relevant features of joint points in a plurality of fully-connected graphs. The top view submodel is suitable for processing a plurality of fully connected graphs with relatively small gesture difference. The bottom view sub-model described above is suitable for processing a plurality of fully connected graphs having relatively large differences in gestures.
In specific implementation, when the similarity degree of a plurality of gestures in a plurality of fully-connected graphs is determined to be greater than a preset similarity degree threshold, the classifier may input the plurality of connected graphs to the top view sub-model for semantic recognition processing, so as to output a corresponding target semantic recognition result.
Conversely, when the similarity degree of the gestures in the full-connected graphs is determined to be less than or equal to the preset similarity degree threshold value, the classifier can input the full-connected graphs into the bottom-view submodel for semantic recognition processing so as to output corresponding target semantic recognition results.
Through the embodiment, a plurality of fully-connected graphs under different types of conditions can be distinguished by using a classifier in a preset sign language recognition model, and then targeted recognition processing can be performed on the plurality of fully-connected graphs by using a matched top view sub-model or bottom view sub-model, so that target semantic content can be determined more accurately.
In some embodiments, after obtaining the target semantic content, when the method is implemented, the following may be further included:
s1: determining a target text for replying a target user according to the target semantic content;
s2: and conveying the semantic content represented by the target text to a target user.
Through the embodiment, the server can find the matched target text according to the identified target semantic content; and transmitting corresponding semantic content to the target user according to the matched target text, and timely replying the target user.
In specific implementation, the server can be configured with a preset text library; the preset text library stores a large amount of preset texts for consulting conventional problems. The server can search a preset text library to find out a preset text matched with the target semantic content as a target text. After the preset text library is searched, the server can also forward the target semantic content to the customer service staff in time under the condition that the matched preset text is not found, so that the customer service staff can be switched in time to perform manual reply. In addition, the server may also perform corresponding operation processing (e.g., query processing) according to the target text to obtain a corresponding processing result (e.g., query result); and generating a corresponding target text according to the processing result.
In some embodiments, the semantic content characterized by the target text is communicated to the target user, and when implemented, the following content may be included: inquiring a preset video database, and determining a sign language action video matched with the target text as a target sign language video; displaying the target sign language video to the target user; and/or querying a preset audio database, and determining the voice audio matched with the target text as the target voice audio; and playing the target voice audio to the target user.
The preset video database can be pre-stored with a plurality of preset sign language action videos, and each preset sign language action video carries a corresponding semantic tag. Similarly, a plurality of preset audio data may be stored in the preset audio database in advance, and each preset audio data carries a corresponding semantic tag.
Specifically, for example, the server may send the target sign language video to the terminal device, and the terminal device may display the target sign language video through the screen to convey semantic content represented by the target text to the user. The server can also send the target voice audio to the terminal device, and the terminal device can play the target voice audio through the voice player so as to convey semantic content represented by the target text to the user.
Through the embodiment, the semantic content represented by the target text can be transmitted to the target user by adopting various transmission modes, so that the diversified requirements of the user are met, and the interactive experience of the user is improved.
As can be seen from the above, based on the sign language recognition method provided in the embodiments of the present specification, before specific implementation, a preset sign language recognition model that can be suitable for processing images in a complex low-illumination environment can be obtained through pre-training; wherein, the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph. In specific implementation, after target image data containing sign language actions of a target user are acquired; firstly, a corresponding target thermodynamic diagram containing a plurality of frames of thermodynamic diagrams is constructed and obtained according to target image data; processing the target thermal map to obtain a plurality of full-connection maps; and processing the full-connected graphs by using a preset sign language recognition model to obtain a target semantic recognition result so as to accurately determine the target semantic content represented by the sign language action of the target user. Therefore, on one hand, target image data are processed by introducing heat map regression, spatial information in the image can be fully utilized, and abundant relevant characteristics of the joint points can be learned, so that the method can better deal with processing images in complex environments; on the other hand, the full-connected graph to be processed is firstly subjected to illumination correction by utilizing the illumination component and the reflection component in the image through training and utilizing the low-illumination processing network layer in the preset sign language recognition model, and after the corrected full-connected graph with a better effect is obtained, the extraction and processing of the image characteristics are carried out, so that the image under the low-illumination environment can be better processed. Therefore, the semantic recognition of the sign language action in the image data under the complex low-illumination environment can be accurately carried out, a semantic recognition result with high accuracy is obtained, and recognition errors are reduced. Further, after sample data is obtained, the sample data is subjected to preset data enhancement processing to expand limited sample data to obtain a richer training sample set; furthermore, the model can be trained by utilizing the training sample set so as to enhance the generalization capability and robustness of the model and obtain a preset sign language recognition model with relatively higher precision and relatively better effect.
Referring to fig. 9, another sign language recognition method is provided in the present embodiment. The method can be applied to the terminal equipment side. When the method is implemented, the following contents can be included:
s901: responding to a trigger operation initiated by a target user, and shooting target image data containing sign language actions of the target user;
s902: sending the target image data to a server; the server is used for constructing and obtaining a corresponding target thermal map according to the target image data; processing the target thermal map to obtain a plurality of fully-connected maps; processing the multiple fully-connected graphs by using a preset sign language recognition model, and determining target semantic content represented by sign language actions of a target user; the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph; the server is also used for determining a matched target text according to the target semantic content;
s903: and conveying the semantic content represented by the target text to a target user.
Through the embodiment, the target user can conveniently express the question or the consultation problem of the target user by shooting the target image data containing the sign language action by using the terminal equipment; the server can accurately determine the target semantic content represented by the sign language action of the target user according to the target image data and determine the matched target text content; the terminal equipment can reach the semantic content represented by the target text content to the target user in time, so that the target user can obtain better interactive experience.
Embodiments of the present specification further provide a server, including a processor and a memory for storing processor-executable instructions, where the processor, when implemented, may perform the following steps according to the instructions: acquiring target image data containing sign language actions of a target user; constructing and obtaining a corresponding target thermal map according to the target image data; wherein the target thermodynamic map comprises a plurality of frames of thermodynamic diagrams; processing the target thermal map to obtain a plurality of fully-connected maps; processing the multiple full-connected graphs by using a preset sign language recognition model to obtain a target semantic recognition result; the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph; and determining the target semantic content represented by the sign language action of the target user according to the target semantic recognition result.
In order to complete the above instructions more accurately, referring to fig. 7, another specific server is provided in the embodiments of the present specification, where the server includes a network communication port 701, a processor 702, and a memory 703, and the above structures are connected by an internal cable, so that the structures may perform specific data interaction.
The network communication port 701 may be specifically configured to obtain target image data including a sign language action of a target user.
The processor 702 may be specifically configured to construct a corresponding target thermal map according to the target image data; wherein the target thermodynamic map comprises a plurality of frames of thermodynamic diagrams; processing the target thermal map to obtain a plurality of full-connection maps; processing the multiple full-connected graphs by using a preset sign language recognition model to obtain a target semantic recognition result; the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph; and determining the target semantic content represented by the sign language action of the target user according to the target semantic recognition result.
The memory 703 may be specifically configured to store a corresponding instruction program.
In this embodiment, the network communication port 701 may be a virtual port bound to different communication protocols, so as to send or receive different data. For example, the network communication port may be a port responsible for web data communication, a port responsible for FTP data communication, or a port responsible for mail data communication. In addition, the network communication port can also be a communication interface or a communication chip of an entity. For example, it may be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it can also be a Wifi chip; it may also be a bluetooth chip.
In this embodiment, the processor 702 may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The description is not intended to be limiting.
In this embodiment, the memory 703 may include multiple layers, and in a digital system, the memory may be any memory as long as it can store binary data; in an integrated circuit, a circuit without a physical form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.
The present specification further provides a computer-readable storage medium based on the above sign language identification method, where the computer-readable storage medium stores computer program instructions, and when the computer program instructions are executed, the computer program instructions implement: acquiring target image data containing sign language actions of a target user; constructing and obtaining a corresponding target thermal map according to the target image data; wherein the target thermodynamic map comprises a plurality of frames of thermodynamic diagrams; processing the target thermal map to obtain a plurality of full-connection maps; processing the multiple full-connected graphs by using a preset sign language recognition model to obtain a target semantic recognition result; the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph; and determining the target semantic content represented by the sign language action of the target user according to the target semantic recognition result.
In this embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer-readable storage medium can be explained in comparison with other embodiments, and are not described herein again.
Referring to fig. 8, in a software level, an embodiment of the present specification further provides a sign language recognition apparatus, which may specifically include the following structural modules:
an obtaining module 801, which may be specifically configured to obtain target image data including a sign language action of a target user;
the constructing module 802 may be specifically configured to construct and obtain a corresponding target thermal map according to the target image data; wherein the target thermodynamic map comprises a plurality of frames of thermodynamic diagrams;
a first processing module 803, which may be specifically configured to process the target thermal map to obtain a plurality of fully-connected maps;
the second processing module 804 may be specifically configured to process the multiple full-link maps by using a preset sign language recognition model to obtain a target semantic recognition result; the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph;
the determining module 805 may be specifically configured to determine, according to the target semantic recognition result, target semantic content represented by the sign language action of the target user.
In some embodiments, when the first processing module 803 is implemented, the target thermal map may be processed in the following manner to obtain a plurality of fully-connected maps: filtering a plurality of thermodynamic diagrams contained in the target thermodynamic diagram by using a Gaussian filter function to obtain peak data of the thermodynamic diagrams; and processing a plurality of thermodynamic diagrams contained in the target thermodynamic diagram and peak data of the thermodynamic diagrams by using a preset thermodynamic diagram processing model to obtain a plurality of full-connected diagrams.
In some embodiments, the preset thermodynamic diagram processing model may specifically include a joint hotspot diagram detector and a spatial submodel connected in series; wherein the joint heat map detector is used for determining and outputting a corresponding joint heat map according to the input thermodynamic diagram and peak data of the thermodynamic diagram; and the space submodel is used for constructing and outputting a corresponding full-connection graph according to the input joint point heat map.
In some embodiments, the joint point heat map detector may specifically include a model structure trained based on a deep neural network model; the spatial submodel comprises a model structure obtained based on training of an improved Markov random field spatial model.
In some embodiments, the low-light processing network layer may include at least: decomposing the submodel, adjusting the submodel and rebuilding the submodel; wherein the decomposition submodel is configured to decompose the illumination component and the reflection component from the fully-connected graph; the adjustment submodel is used for correcting the illumination component and the reflection component and outputting the corrected illumination component and the corrected reflection component; and the reconstruction submodel is used for reconstructing a full-link graph according to the corrected illumination component and the corrected reflection component so as to obtain a corrected full-link graph.
In some embodiments, the preset sign language recognition model comprises a model structure trained based on YOLO v 5.
In some embodiments, the preset sign language recognition model is further connected with a classifier; the classifier is also respectively connected with a top view sub-model and a bottom view sub-model.
In some embodiments, the method apparatus may further include a modeling module, which may be specifically configured to obtain sample data; the sample data comprises a sample image which is collected under a low-illumination environment and contains a gesture and a sample image which is collected under a normal-illumination environment and contains a gesture; processing the sample data according to a preset processing rule to obtain a training sample set; constructing an initial sign language recognition model; wherein the initial sign language recognition model at least comprises an initial low-light processing network layer; and training the initial sign language recognition model by using the training sample set to obtain a preset sign language recognition model.
In some embodiments, when the modeling module is implemented, the sample data may be processed according to a preset processing rule in the following manner to obtain a training sample set: carrying out preset data enhancement processing on the sample image to obtain a sample image subjected to data enhancement processing; marking the positions of the joint points, the connection relation among the joint points and the semantic content of the sample image in the sample image subjected to data enhancement processing to obtain a marked sample image; and combining the marked sample images to obtain a training sample set.
In some embodiments, the preset data enhancement processing may specifically include at least one of: image flipping processing, image scaling processing, image rotation processing, and the like.
In some embodiments, the apparatus may further include a communicating module, which may be configured to determine, after obtaining the target semantic content, a target text for replying to the target user according to the target semantic content; and conveying the semantic content represented by the target text to a target user.
In some embodiments, the communication module, when embodied, may communicate semantic content characterized by the target text to a target user as follows: inquiring a preset video database, and determining a sign language action video matched with the target text as a target sign language video; displaying the target sign language video to the target user; and/or querying a preset audio database, and determining the voice audio matched with the target text as the target voice audio; and playing the target voice audio to the target user.
It should be noted that, the units, devices, modules, etc. illustrated in the above embodiments may be implemented by a computer chip or an entity, or implemented by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. It is to be understood that, in implementing the present specification, functions of each module may be implemented in one or more pieces of software and/or hardware, or a module that implements the same function may be implemented by a combination of a plurality of sub-modules or sub-units, or the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Therefore, the sign language recognition device provided by the embodiment of the specification can accurately perform semantic recognition on the sign language actions in the image data under the complex low-illumination environment, obtain a semantic recognition result with high accuracy, and reduce recognition errors. The method can also expand and enhance limited sample data to obtain a richer training sample set; and training the model by using the training sample set so as to improve the generalization capability and robustness of the model and obtain a preset sign language recognition model with relatively higher precision and relatively better effect.
Although the present specification provides method steps as described in the examples or flowcharts, additional or fewer steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an apparatus or client product in practice executes, it may execute sequentially or in parallel (e.g., in a parallel processor or multithreaded processing environment, or even in a distributed data processing environment) according to the embodiments or methods shown in the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded. The terms first, second, etc. are used to denote names, but not any particular order.
Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may therefore be considered as a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer-readable storage media including memory storage devices.
From the above description of the embodiments, it is clear to those skilled in the art that the present specification can be implemented by software plus necessary general hardware platform. With this understanding, the technical solutions in the present specification may be essentially embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments in the present specification.
The embodiments in the present specification are described in a progressive manner, and the same or similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. The description is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
While the specification has been described with examples, those skilled in the art will appreciate that there are numerous variations and permutations of the specification that do not depart from the spirit of the specification, and it is intended that the appended claims include such variations and modifications that do not depart from the spirit of the specification.

Claims (15)

1. A sign language identification method, comprising:
acquiring target image data containing sign language actions of a target user;
constructing and obtaining a corresponding target thermal map according to the target image data; wherein the target thermodynamic map comprises a plurality of frames of thermodynamic diagrams;
processing the target thermal map to obtain a plurality of full-connection maps;
processing the multiple full-connected graphs by using a preset sign language recognition model to obtain a target semantic recognition result; the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph;
and determining the target semantic content represented by the sign language action of the target user according to the target semantic recognition result.
2. The method of claim 1, wherein processing the target thermal map to obtain a plurality of fully connected maps comprises:
filtering a plurality of thermodynamic diagrams contained in the target thermodynamic diagram by using a Gaussian filter function to obtain peak data of the thermodynamic diagrams;
and processing a plurality of thermodynamic diagrams contained in the target thermodynamic diagram and peak data of the thermodynamic diagrams by using a preset thermodynamic diagram processing model to obtain a plurality of full-connected diagrams.
3. The method of claim 2, wherein the pre-set thermodynamic diagram processing model comprises a joint hot spot diagram detector and a spatial submodel in series; wherein the joint heat map detector is used for determining and outputting a corresponding joint heat map according to the input thermodynamic diagram and peak data of the thermodynamic diagram; and the space submodel is used for constructing and outputting a corresponding full-connection graph according to the input joint point heat map.
4. The method of claim 3, wherein the joint point heat map detector comprises a model structure trained based on a deep neural network model; the spatial submodel comprises a model structure obtained based on training of an improved Markov random field spatial model.
5. The method of claim 1, wherein the low-level processing network layer comprises at least: decomposing the submodel, adjusting the submodel and rebuilding the submodel; wherein the decomposition submodel is configured to decompose the illumination component and the reflection component from the fully-connected graph; the adjustment submodel is used for correcting the illumination component and the reflection component and outputting the corrected illumination component and the corrected reflection component; and the reconstruction submodel is used for reconstructing the full connection image according to the corrected illumination component and the corrected reflection component so as to obtain the corrected full connection image.
6. The method of claim 5, wherein the preset sign language recognition model comprises a model structure trained based on YOLO v 5.
7. The method according to claim 6, wherein the preset sign language recognition model is further connected with a classifier; the classifier is also respectively connected with a top view sub-model and a bottom view sub-model.
8. The method of claim 7, further comprising:
acquiring sample data; the sample data comprises a sample image which is collected under a low-illumination environment and contains a gesture and a sample image which is collected under a normal-illumination environment and contains a gesture;
processing the sample data according to a preset processing rule to obtain a training sample set;
constructing an initial sign language recognition model; wherein the initial sign language recognition model at least comprises an initial low-light processing network layer;
and training the initial sign language recognition model by using the training sample set to obtain a preset sign language recognition model.
9. The method of claim 8, wherein processing the sample data according to a preset processing rule to obtain a training sample set comprises:
carrying out preset data enhancement processing on the sample image to obtain a sample image subjected to data enhancement processing;
marking the positions of the joint points, the connection relation among the joint points and the semantic content of the sample image in the sample image subjected to data enhancement processing to obtain a marked sample image;
and combining the marked sample images to obtain a training sample set.
10. The method of claim 9, wherein the pre-defined data enhancement process comprises at least one of: image flipping processing, image scaling processing and image rotation processing.
11. The method of claim 1, wherein after obtaining the target semantic content, the method further comprises:
determining a target text for replying a target user according to the target semantic content;
and conveying semantic content characterized by the target text to a target user.
12. The method of claim 11, wherein communicating semantic content characterized by the target text to a target user comprises:
inquiring a preset video database, and determining a sign language action video matched with the target text as a target sign language video; displaying the target sign language video to the target user;
and/or the presence of a gas in the gas,
inquiring a preset audio database, and determining a voice audio matched with the target text as a target voice audio; and playing the target voice audio to the target user.
13. A sign language recognition apparatus, comprising:
the acquisition module is used for acquiring target image data containing sign language actions of a target user;
the construction module is used for constructing and obtaining a corresponding target thermal map according to the target image data; wherein the target thermodynamic map comprises a plurality of frames of thermodynamic diagrams;
the first processing module is used for processing the target thermal map to obtain a plurality of full-connection maps;
the second processing module is used for processing the full-connection graphs by using a preset sign language recognition model to obtain a target semantic recognition result; the preset sign language recognition model at least comprises a low-illumination processing network layer; the low-illumination processing network layer is used for correcting the illumination component and the reflection component of the joint point heat map in the full-connected graph to obtain a corrected full-connected graph;
and the determining module is used for determining the target semantic content represented by the sign language action of the target user according to the target semantic recognition result.
14. A server comprising a processor and a memory for storing processor-executable instructions which, when executed by the processor, implement the steps of the method of any one of claims 1 to 12.
15. A computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 12.
CN202210560752.8A 2022-05-23 2022-05-23 Sign language identification method and device and server Pending CN114821807A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210560752.8A CN114821807A (en) 2022-05-23 2022-05-23 Sign language identification method and device and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210560752.8A CN114821807A (en) 2022-05-23 2022-05-23 Sign language identification method and device and server

Publications (1)

Publication Number Publication Date
CN114821807A true CN114821807A (en) 2022-07-29

Family

ID=82518042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210560752.8A Pending CN114821807A (en) 2022-05-23 2022-05-23 Sign language identification method and device and server

Country Status (1)

Country Link
CN (1) CN114821807A (en)

Similar Documents

Publication Publication Date Title
US11670012B2 (en) Encoding and decoding a stylized custom graphic
CN112232425B (en) Image processing method, device, storage medium and electronic equipment
WO2018121777A1 (en) Face detection method and apparatus, and electronic device
US20210174072A1 (en) Microexpression-based image recognition method and apparatus, and related device
CN108664651B (en) Pattern recommendation method, device and storage medium
US20210312523A1 (en) Analyzing facial features for augmented reality experiences of physical products in a messaging system
US11521334B2 (en) Augmented reality experiences of color palettes in a messaging system
CN111414946B (en) Artificial intelligence-based medical image noise data identification method and related device
CN110555896B (en) Image generation method and device and storage medium
US11282257B2 (en) Pose selection and animation of characters using video data and training techniques
US11915305B2 (en) Identification of physical products for augmented reality experiences in a messaging system
KR20240038939A (en) Image processing apparatus and method for transfering style
US20210312678A1 (en) Generating augmented reality experiences with physical products using profile information
JP2010531010A (en) Matching images with shape descriptors
CN112839223B (en) Image compression method, image compression device, storage medium and electronic equipment
WO2021203118A1 (en) Identification of physical products for augmented reality experiences in a messaging system
US20220319231A1 (en) Facial synthesis for head turns in augmented reality content
CN113254491A (en) Information recommendation method and device, computer equipment and storage medium
US20230082715A1 (en) Method for training image processing model, image processing method, apparatus, electronic device, and computer program product
CN111553838A (en) Model parameter updating method, device, equipment and storage medium
US11361467B2 (en) Pose selection and animation of characters using video data and training techniques
US20220319082A1 (en) Generating modified user content that includes additional text content
CN114821807A (en) Sign language identification method and device and server
WO2022212669A1 (en) Determining classification recommendations for user content
CN114005156A (en) Face replacement method, face replacement system, terminal equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination