CN111652164B - Isolated word sign language recognition method and system based on global-local feature enhancement - Google Patents

Isolated word sign language recognition method and system based on global-local feature enhancement Download PDF

Info

Publication number
CN111652164B
CN111652164B CN202010513333.XA CN202010513333A CN111652164B CN 111652164 B CN111652164 B CN 111652164B CN 202010513333 A CN202010513333 A CN 202010513333A CN 111652164 B CN111652164 B CN 111652164B
Authority
CN
China
Prior art keywords
global
features
local
sign language
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010513333.XA
Other languages
Chinese (zh)
Other versions
CN111652164A (en
Inventor
李厚强
周文罡
胡鹤臻
蒲俊福
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202010513333.XA priority Critical patent/CN111652164B/en
Publication of CN111652164A publication Critical patent/CN111652164A/en
Application granted granted Critical
Publication of CN111652164B publication Critical patent/CN111652164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/70Multimodal biometrics, e.g. combining information from different biometric modalities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a global-local feature enhancement-based isolated word sign language recognition method and a system, wherein the method comprises the following steps: acquiring a sign language video to be recognized, and performing feature extraction on the sign language video to be recognized through a shared convolution layer to obtain a feature map; capturing context information of the feature graph to obtain global features; capturing fine-grained information of the feature map to obtain local features; and performing collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized. The invention can realize the consideration of global and local characteristics at the same time, and realize the self-adaptive enhancement of each characteristic; meanwhile, the learning of the two characteristics can be mutually promoted, so that the accuracy of sign language recognition is further improved.

Description

Isolated word sign language recognition method and system based on global-local feature enhancement
Technical Field
The invention relates to the technical field of sign language recognition, in particular to a sign language recognition method and system for isolated words based on global-local feature enhancement.
Background
According to the second investigation of nationwide disabled people, the number of hearing disabled patients in China is up to 2780 thousands. In hearing impaired people, the most common communication medium is sign language. Sign language, a visual language, has its unique linguistic characteristics. The semantic information is expressed by using fine-grained non-manual control features (expressions, lip shapes and the like) in an auxiliary mode mainly through manual control features (hand shapes, hand movements, positions and the like) with context correlation.
In order to facilitate communication between listeners and deaf people, sign language recognition has been developed and widely studied. The method converts the input sign language video into corresponding text or voice to be output through a computer algorithm. The research relates to the fields of multi-modal human-computer interaction, computer vision, natural language processing and the like.
The isolated word sign language recognition means that a video of a sign language word is input, and a system recognizes a word corresponding to the video. Isolated word sign language recognition can be viewed as a fine-grained classification problem. The accurate discrimination of the sign language isolated words not only depends on the manual control characteristics, but also plays an important role in fine-grained non-manual control characteristics. There are some confusing isolated words with different meanings that have the same manual features but different non-manual features. As in Chinese sign language, the words "if" and "fake" are distinguished only by mouth movement. This ambiguity problem presents a significant challenge to the accurate recognition of the spoken word. The identification process of the whole system comprises the steps of firstly extracting representations of input sign language videos, then converting the representations into probability vectors through transformation, and taking the category with the maximum probability as a final identification result. With the development of deep learning and hardware computing power in recent years, isolated word sign language recognition systems based on deep learning dominate. The method comprises the steps of extracting representation through a Convolutional Neural Network (CNN), converting the representation into probability vectors after passing through a full connection layer and a Softmax layer, and taking a category corresponding to the maximum probability as an identification result.
Therefore, in the isolated word sign language recognition, the step of extracting features is important. The traditional identification methods are divided into two types: features are extracted directly from the global. Some fine-grained local clues exist in the sign language, and the method lacks attention to the clues, so that misclassification is caused; meanwhile, some recognition methods exist for extracting local hand features as assistance, but the method still cannot adaptively pay attention to fine-grained non-manual features in the confusing words.
The above two disadvantages are the main problems in the prior art, and therefore, how to achieve the global and local features simultaneously and achieve the adaptive enhancement of each feature. Meanwhile, the learning of the two characteristics can be mutually promoted so as to further improve the accuracy of sign language recognition, and the method is a problem to be solved urgently.
Disclosure of Invention
In view of the above, the invention provides an isolated word sign language recognition method based on global-local feature enhancement, which can realize the simultaneous consideration of global and local features and realize the adaptive enhancement of each feature; meanwhile, the learning of the two characteristics can be mutually promoted, so that the accuracy of sign language recognition is further improved.
The invention provides an isolated word sign language recognition method based on global-local feature enhancement, which comprises the following steps:
acquiring a sign language video to be identified;
carrying out feature extraction on the sign language video to be identified through a shared convolution layer to obtain a feature map;
capturing context information of the feature graph to obtain global features;
capturing fine-grained information of the feature graph to obtain local features;
and performing collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized.
Preferably, the capturing context information of the feature map to obtain a global feature includes:
generating a feature A, a feature B and a feature C which have the same shape as the feature map through independent convolution layers for the feature map X;
defining an enhanced feature map E based on the features A and B;
and aggregating the features from the features C by using the enhanced feature map E, and forming global features with the feature map X.
Preferably, the capturing fine-grained information of the feature map to obtain a local feature includes:
calculating integrals of the saliency map along the X axis and the Y axis and normalizing to obtain distribution functions about the X axis and the Y axis;
and sampling the characteristic diagram based on the inverse function of the distribution function to obtain the sampled local characteristics.
Preferably, the saliency map is generated from high-order features of global features via trilinear attention.
Preferably, the obtaining the recognition result of the sign language video to be recognized through collaborative learning based on the global features and the local features includes:
and performing collaborative learning based on the global features and the local features, and taking the category with the highest prediction probability as the recognition result of the sign language video to be recognized.
An isolated word sign language recognition system based on global-local feature enhancement, comprising:
the acquisition module is used for acquiring a sign language video to be identified;
the characteristic extraction module is used for extracting the characteristics of the sign language video to be identified through the shared convolution layer to obtain a characteristic diagram;
the global enhancement module is used for capturing the context information of the feature map to obtain global features;
the local enhancement module is used for capturing fine-grained information of the feature map to obtain local features;
and the collaborative learning module is used for carrying out collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized.
Preferably, the global enhancing module is specifically configured to:
generating a feature A, a feature B and a feature C which have the same shape as the feature map through independent convolution layers for the feature map X;
defining an enhanced feature map E based on the features A and B;
and aggregating the features from the features C by using the enhanced feature map E, and forming global features with the feature map X.
Preferably, the local boost module is specifically configured to:
calculating integrals of the saliency map along the X axis and the Y axis and normalizing to obtain distribution functions about the X axis and the Y axis;
and sampling the characteristic diagram based on the inverse function of the distribution function to obtain the sampled local characteristics.
Preferably, the saliency map is generated from high-order features of global features via trilinear attention.
Preferably, the collaborative learning module is specifically configured to:
and performing collaborative learning based on the global features and the local features, and taking the category with the highest prediction probability as the recognition result of the sign language video to be recognized.
In summary, the invention discloses an isolated word sign language identification method based on global-local feature enhancement, when an isolated word sign language needs to be identified, firstly, a sign language video to be identified is obtained, and then, feature extraction is carried out on the sign language video to be identified through a shared convolution layer to obtain a feature map; capturing context information of the feature graph to obtain global features; capturing fine-grained information of the feature map to obtain local features; and performing collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized. The invention can realize the consideration of global and local characteristics at the same time, and realizes the self-adaptive enhancement of each characteristic; meanwhile, the learning of the two characteristics can be mutually promoted, so that the accuracy of sign language recognition is further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of an embodiment of a sign language recognition method for isolated words based on global-local feature enhancement according to the present invention;
FIG. 2 is a schematic structural diagram of an embodiment of an isolated word sign language recognition system based on global-local feature enhancement according to the present invention;
FIG. 3 is a schematic diagram of the overall-local feature enhancement-based isolated word sign language recognition system according to the present invention;
FIG. 4 is a schematic diagram of the operation of the global augmentation module disclosed herein;
fig. 5 is a schematic diagram of the operation of the local boost module disclosed in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
As shown in fig. 1, which is a flowchart of a method of an embodiment of a global-local feature enhancement-based isolated word sign language recognition method disclosed by the present invention, the method may include:
s101, acquiring a sign language video to be identified;
when isolated word sign language needs to be recognized, firstly, a sign language video to be recognized is obtained.
S102, extracting the features of the sign language video to be recognized through a shared convolution layer to obtain a feature map;
then, as shown in fig. 3, the sign language video to be recognized is led into a shared convolution layer to extract features, and a feature map is obtained. The network is then divided into two branches: a global branch for capturing context information and a local branch for capturing fine-grained cues.
S103, capturing context information of the feature graph to obtain global features;
the global enhancement module of the global branch is intended to capture motion cues containing context information, and filter out irrelevant interference information, such as the clothing and background of the speaker.
Specifically, as shown in fig. 3 and 4, for one signature X ∈ C × T × H × W, first, three signatures having the same shape as the signature X are generated by passing the signature into independent convolutional layers, and the three signatures are denoted as a, B, and C, respectively. Then, it is transformed into C × N, where N ═ T × H × W. Matrix multiplication is then used to calculate the similarity between voxels and normalization is performed using the Softmax function. This operation defines the enhanced feature map E ∈ RN×NAs follows:
Figure BDA0002529081750000061
wherein, AiAnd BjEach represents ithAnd jthA feature vector of the location. EijTo denote jthPosition is given tothThe influence of (c). Next, the features from feature C are weight aggregated using enhanced feature map E:
Figure BDA0002529081750000062
thus, with long range dependence, the ability to enhance each voxel is provided. Contextual cues with discriminative power are mutually enhanced while irrelevant information is compressed.
S104, capturing fine-grained information of the feature map to obtain local features;
besides complex motion cues with context information, there are also some fine-grained cues in sign language video. These cues typically include lip shape, line of sight, facial expression, or a combination of these. They occupy a small spatial dimension and vary in video over time. Therefore, they easily disappear in repeated convolution and pooling operations. In order to retain such fine-grained clues, a local enhancement module based on adaptive sampling is introduced, and adaptive sampling is carried out on the input feature map X ∈ C × T × H × W.
Specifically, as shown in fig. 3 and 5, the feature map X is plotted at a certain time ttFor example, the local enhancement module takes saliency map S e H × W as a guide: where the saliency values are larger, they are sampled more densely. Specifically, the saliency map is first integrated along the x and y axes and normalized:
Figure BDA0002529081750000071
Figure BDA0002529081750000072
wherein k isx∈[1,W],ky∈[1,H]. Thus, a distribution function about the x and y axes is obtained and sampled according to the inverse of the distribution function, as follows:
Figure BDA0002529081750000073
where O is a feature map sampled at time t.
Wherein, the saliency map S is generated through trilinear attention according to the high-order features of the global branch:
Figure BDA0002529081750000074
wherein Y ist∈C2×N2And referring to high-order features from the global branch after deformation corresponding to the time t, mean (-) represents that averaging operation is carried out along the channel dimension, and is used for generating a more robust saliency map. S further morphs and upsamples to XtHave the same spatial dimensions. By resampling the original feature map, fine-grained clues are located and emphasized and are therefore also more easily captured by the convolution operation.
And S105, performing collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized.
As shown in FIG. 3, the output z of the global branch for capturing context information and the local branch for capturing fine-grained cluesglobalAnd zlocalAll use cross entropy loss to supervise, respectively denoted ΛglobalAnd Λlocal. To further facilitate learning of these two branches, the losses Λ are mutually facilitatedmuFor constraining the relationship between the two branches. The learning of the whole framework is supervised by superposition of the above loss functions:
Λ=Λgloballocalmu
and in the testing stage, the results of the global branch and the local branch are fused, and the category with the highest prediction probability is taken as a prediction result.
Specifically, the global enhancement module and the local enhancement module enhance key features in the sign language video from two complementary angles. One is dedicated to capturing long-range context-dependent information, and the other emphasizes discriminating fine-grained clues. However, the local enhancement module inevitably brings global information loss, and it is difficult to feed back information to the global enhancement module. To this end, the present invention uses a collaborative learning module such that the two modules are optimized in a collaborative mode. Let the probability distributions predicted by the two branches be p1And p2The degree of match between the two branches was calculated using the divergence (Kullback Leibler, KL) as follows:
Figure BDA0002529081750000081
Figure BDA0002529081750000082
where M represents the total number of categories. The sum of the above two KL divergences is taken as the cooperative learning loss function:
Λmu=DKL(p2||p1)+DKL(p1||p2)。
although these two branches emphasize global and local features, respectively, their goal is correct recognition of the same sign language video. By taking the distribution predicted by the other party as a reference, a connection between the two branches is established. The local branch can implicitly influence the sampling process. Meanwhile, the fine-grained clues emphasized after correction can be better classified and identified together with the global branch.
In summary, the invention can realize the consideration of global and local characteristics at the same time, and realize the adaptive enhancement of each characteristic; meanwhile, the learning of the two characteristics can be mutually promoted, so that the accuracy of sign language recognition is further improved.
As shown in fig. 2, which is a schematic structural diagram of an embodiment of an isolated word sign language recognition system based on global-local feature enhancement disclosed in the present invention, the system may include:
an obtaining module 201, configured to obtain a sign language video to be identified;
when the isolated word sign language needs to be recognized, a sign language video to be recognized is obtained first.
The feature extraction module 202 is configured to perform feature extraction on a sign language video to be identified through a shared convolution layer to obtain a feature map;
and then, introducing the sign language video to be recognized into a shared convolution layer to extract features, so as to obtain a feature map. The network is then divided into two branches: a global branch for capturing context information and a local branch for capturing fine-grained cues.
The global enhancing module 203 is used for capturing context information of the feature map to obtain global features;
the global enhancement module of the global branch is intended to capture motion cues containing context information, and filter out irrelevant interference information, such as the clothing and background of the speaker.
Specifically, as shown in fig. 3 and 4, for one signature X ∈ C × T × H × W, three signatures having the same shape as the signature X are first generated by passing the signature X through independent convolutional layers, and are denoted as a, B, and C, respectively. Then, it is transformed into C × N, where N ═ T × H × W. Matrix multiplication is then used to compute the similarity between voxels and normalization is performed using the Softmax function. This operation defines the enhanced feature map E ∈ RN×NAs follows:
Figure BDA0002529081750000091
wherein, AiAnd BjEach represents ithAnd jthA feature vector of the location. EijTo denote jthPosition given tothThe influence of (c). Next, the features from feature C are weighted aggregated using enhanced feature map E:
Figure BDA0002529081750000092
thus, with long range dependence, the ability to enhance each voxel is provided. Contextual cues with discriminative power are mutually enhanced while irrelevant information is compressed.
The local enhancement module 204 is configured to capture fine-grained information of the feature map to obtain a local feature;
besides complex motion cues with contextual information, sign language video also has some fine-grained cues. These cues typically include lip shape, line of sight, facial expression, or a combination of these. They occupy a small space size and vary in video over time. Therefore, they easily disappear in repeated convolution and pooling operations. In order to retain such fine-grained clues, a local enhancement module based on adaptive sampling is introduced, and adaptive sampling is carried out on the input feature map X ∈ C × T × H × W.
Specifically, as shown in fig. 3 and 5, the feature map X is plotted at a certain time ttFor example, the local enhancement module takes saliency map S e H × W as a guide: where the saliency values are larger, they are sampled more densely. Specifically, the saliency map is first integrated along the x and y axes and normalized:
Figure BDA0002529081750000101
Figure BDA0002529081750000102
wherein k isx∈[1,W],ky∈[1,H]. Thus, a distribution function about the x and y axes is obtained and sampled according to the inverse of the distribution function, as follows:
Figure BDA0002529081750000103
where O is a feature map sampled at time t.
Wherein, the saliency map S is generated through trilinear attention according to the high-order features of the global branch:
Figure BDA0002529081750000104
wherein Y ist∈C2×N2And referring to high-order features from the global branch after deformation corresponding to the time t, mean (-) represents that averaging operation is carried out along the channel dimension, and is used for generating a more robust saliency map. S further morphs and upsamples to XtHave the same spatial dimensions. By resampling the original feature map, fine-grained clues are located and emphasized and are therefore also more easily captured by the convolution operation.
And the collaborative learning module 205 is configured to perform collaborative learning based on the global features and the local features to obtain a recognition result of the sign language video to be recognized.
As shown in FIG. 3, the output z of the global branch for capturing context information and the local branch for capturing fine-grained cuesglobalAnd zlocalAll use cross entropy loss to supervise, respectively denoted ΛglobalAnd Λlocal. To further facilitate learning of these two branches, the losses Λ are mutually facilitatedmuFor constraining the relationship between the two branches. The learning of the whole framework is supervised by the superposition of the above several loss functions:
Λ=Λgloballocalmu
and in the testing stage, the results of the global branch and the local branch are fused, and the class with the highest prediction probability is taken as a prediction result.
Specifically, the global enhancement module and the local enhancement module enhance key features in the sign language video from two complementary angles. One is dedicated to capturing long-range context-dependent information, and the other emphasizes discriminating fine-grained clues. However, the local enhancement module inevitably brings global informationWhile it is difficult to feed back information to the global augmentation module. To this end, the present invention uses a collaborative learning module such that the two modules are optimized in a collaborative mode. Let the probability distributions predicted by the two branches be p1And p2The degree of match between the two branches is calculated using the divergence (Kullback Leibler, KL) as follows:
Figure BDA0002529081750000111
Figure BDA0002529081750000112
where M represents the total number of categories. The sum of the above two KL divergences is taken as the cooperative learning loss function:
Λmu=DKL(p2||p1)+DKL(p1||p2)。
although these two branches emphasize global and local features, respectively, their goal is correct recognition of the same sign language video. By taking the distribution predicted by the other party as a reference, a connection between the two branches is established. The local branch can implicitly influence the sampling process. Meanwhile, the fine-grained clues emphasized after correction can be better classified and identified together with the global branch.
In summary, the invention can realize the global and local characteristics simultaneously, and realize the self-adaptive enhancement of each characteristic; meanwhile, the learning of the two characteristics can be mutually promoted, so that the accuracy of sign language recognition is further improved.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A sign language recognition method for isolated words based on global-local feature enhancement is characterized by comprising the following steps:
acquiring a sign language video to be identified, wherein the sign language video comprises isolated word sign languages;
performing feature extraction on the sign language video to be identified through a shared convolution layer to obtain a feature map;
capturing context information of the feature graph to obtain global features;
capturing fine-grained information of the feature graph to obtain local features;
based on the global features and the local features, cross entropy loss functions and mutual promotion loss functions are used for supervision and collaborative learning to obtain recognition results of the sign language videos to be recognized, and the mutual promotion loss functions
Figure FDA0003656821400000011
Is equal to DKL(p2||p1)+DKL(p1||p2) Wherein p is1And p2Probability distributions predicted separately for global and local branches, DKL(p2||p1) And DKL(p1||p2) The matching degrees between the global branch and the local branch are respectively calculated by using divergence (KL); wherein the output z of the global branch for capturing context information and the local branch for capturing fine-grained cluesglobalAnd zlocalAll use cross entropy loss function to supervise, and respectively record as
Figure FDA0003656821400000012
And
Figure FDA0003656821400000013
the collaborative learning using cross-entropy loss function and mutual facilitation loss function supervision comprises: use functions
Figure FDA0003656821400000014
Supervising and performing cooperative learning;
the capturing the context information of the feature graph to obtain the global feature comprises:
generating a feature A, a feature B and a feature C which have the same shape as the feature map through independent convolution layers for the feature map X;
defining an enhanced feature map E based on the features A and B;
aggregating features from the features C by using the enhanced feature map E, and forming global features with the feature map X;
capturing fine-grained information of the feature map to obtain local features, wherein the capturing comprises:
calculating integrals of the saliency map along the X axis and the Y axis and normalizing the integrals to obtain distribution functions of the saliency map about the X axis and the Y axis;
and carrying out self-adaptive sampling on the characteristic diagram based on the inverse function of the distribution function to obtain the sampled local characteristics.
2. The method of claim 1, wherein the saliency map is generated from high-order features of global features via trilinear attention generation.
3. The method according to claim 1, wherein the performing collaborative learning based on the global features and the local features to obtain the recognition result of the sign language video to be recognized comprises:
and performing collaborative learning based on the global features and the local features, and taking the category with the highest prediction probability as the recognition result of the sign language video to be recognized.
4. A system for isolated word sign language recognition based on global-local feature enhancement, comprising:
the acquisition module is used for acquiring a sign language video to be recognized, wherein the sign language video comprises isolated words of sign language;
the feature extraction module is used for extracting features of the sign language video to be identified through a shared convolution layer to obtain a feature map;
the global enhancement module is used for capturing the context information of the feature map to obtain global features;
the local enhancement module is used for capturing fine-grained information of the feature map to obtain local features;
a collaborative learning module to learn based on the global features and the local featuresCharacteristic, using cross entropy loss function and mutual promotion loss function to supervise and carry out cooperative learning to obtain the recognition result of the sign language video to be recognized, wherein the mutual promotion loss function is equal to DKL(p2||p1)+DKL(p1||p2) Wherein p is1And p2Probability distributions, D, predicted for global and local branches, respectivelyKL(p2||p1) And DKL(p1||p2) The matching degrees between the global branch and the local branch are respectively calculated by using divergence (KL); wherein the output z of the global branch for capturing context information and the local branch for capturing fine-grained cluesglobalAnd zlocalAll use cross entropy loss function to supervise, and respectively record as
Figure FDA0003656821400000021
And
Figure FDA0003656821400000022
the collaborative learning using cross-entropy loss function and mutual facilitation loss function supervision comprises: use function
Figure FDA0003656821400000023
Supervising gap collaborative learning;
the global augmentation module is specifically configured to:
generating a feature A, a feature B and a feature C which have the same shape as the feature map through independent convolution layers for the feature map X;
defining an enhanced feature map E based on the features A and B;
aggregating features from the features C by using the enhanced feature map E, and forming global features with the feature map X;
the local enhancement module is specifically configured to:
calculating integrals of the saliency map along the X axis and the Y axis and normalizing the integrals to obtain distribution functions of the saliency map about the X axis and the Y axis;
and carrying out self-adaptive sampling on the characteristic diagram based on the inverse function of the distribution function to obtain the sampled local characteristics.
5. The system of claim 4, wherein the saliency map is generated from high-order features of global features via trilinear attention generation.
6. The system of claim 4, wherein the collaborative learning module is specifically configured to:
and performing collaborative learning based on the global features and the local features, and taking the category with the highest prediction probability as the recognition result of the sign language video to be recognized.
CN202010513333.XA 2020-06-08 2020-06-08 Isolated word sign language recognition method and system based on global-local feature enhancement Active CN111652164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010513333.XA CN111652164B (en) 2020-06-08 2020-06-08 Isolated word sign language recognition method and system based on global-local feature enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010513333.XA CN111652164B (en) 2020-06-08 2020-06-08 Isolated word sign language recognition method and system based on global-local feature enhancement

Publications (2)

Publication Number Publication Date
CN111652164A CN111652164A (en) 2020-09-11
CN111652164B true CN111652164B (en) 2022-07-15

Family

ID=72347283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010513333.XA Active CN111652164B (en) 2020-06-08 2020-06-08 Isolated word sign language recognition method and system based on global-local feature enhancement

Country Status (1)

Country Link
CN (1) CN111652164B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861707A (en) * 2021-02-03 2021-05-28 重庆市风景园林科学研究院 Harmful organism visual identification method, device, equipment and readable storage medium
CN114898143B (en) * 2022-04-19 2024-07-05 天津大学 Global and local visual feature-based collaborative classification method, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348420A (en) * 2019-07-18 2019-10-18 腾讯科技(深圳)有限公司 Sign Language Recognition Method, device, computer readable storage medium and computer equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10679042B2 (en) * 2018-10-09 2020-06-09 Irene Rogan Shaffer Method and apparatus to accurately interpret facial expressions in American Sign Language

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348420A (en) * 2019-07-18 2019-10-18 腾讯科技(深圳)有限公司 Sign Language Recognition Method, device, computer readable storage medium and computer equipment

Also Published As

Publication number Publication date
CN111652164A (en) 2020-09-11

Similar Documents

Publication Publication Date Title
WO2015180368A1 (en) Variable factor decomposition method for semi-supervised speech features
CN111652164B (en) Isolated word sign language recognition method and system based on global-local feature enhancement
CN111108508B (en) Face emotion recognition method, intelligent device and computer readable storage medium
CN115311463B (en) Category-guided multi-scale decoupling marine remote sensing image text retrieval method and system
CN112651940A (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
Ahammad et al. Recognizing Bengali sign language gestures for digits in real time using convolutional neural network
CN113935435A (en) Multi-modal emotion recognition method based on space-time feature fusion
CN115797827A (en) ViT human body behavior identification method based on double-current network architecture
CN114821735A (en) Intelligent storage cabinet based on face recognition and voice recognition
Le et al. Multi visual and textual embedding on visual question answering for blind people
CN113936317A (en) Priori knowledge-based facial expression recognition method
Liu et al. Discriminative Feature Representation Based on Cascaded Attention Network with Adversarial Joint Loss for Speech Emotion Recognition.
CN111191035B (en) Method and device for recognizing lung cancer clinical database text entity
Gupta et al. Sign Language Converter Using Hand Gestures
CN111639537A (en) Face action unit identification method and device, electronic equipment and storage medium
CN113420783B (en) Intelligent man-machine interaction method and device based on image-text matching
CN111292741B (en) Intelligent voice interaction robot
CN114881668A (en) Multi-mode-based deception detection method
Monica et al. Recognition of medicine using cnn for visually impaired
CN114373212A (en) Face recognition model construction method, face recognition method and related equipment
Shane et al. Sign Language Detection Using Faster RCNN Resnet
CN112183213A (en) Facial expression recognition method based on Intra-Class Gap GAN
CN111554269A (en) Voice number taking method, system and storage medium
Pao et al. Audio-visual speech recognition with weighted KNN-based classification in mandarin database
Gedaragoda et al. “Hand Model”–A Static Sinhala Sign Language Translation Using Media-Pipe and SVM Compared with Hybrid Model of KNN, SVM and Random Forest Algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant