CN115349142A - Image processing apparatus, image processing method, and computer-readable storage medium - Google Patents
Image processing apparatus, image processing method, and computer-readable storage medium Download PDFInfo
- Publication number
- CN115349142A CN115349142A CN202180023365.4A CN202180023365A CN115349142A CN 115349142 A CN115349142 A CN 115349142A CN 202180023365 A CN202180023365 A CN 202180023365A CN 115349142 A CN115349142 A CN 115349142A
- Authority
- CN
- China
- Prior art keywords
- convolutional
- neural network
- point
- image processing
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012545 processing Methods 0.000 title claims abstract description 71
- 238000003672 processing method Methods 0.000 title claims abstract description 28
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 136
- 238000003062 neural network model Methods 0.000 claims abstract description 42
- 230000000306 recurrent effect Effects 0.000 claims abstract description 38
- 239000011800 void material Substances 0.000 claims description 19
- 238000000034 method Methods 0.000 claims description 12
- 238000010586 diagram Methods 0.000 description 36
- 238000000605 extraction Methods 0.000 description 28
- 238000007781 pre-processing Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 11
- 238000013528 artificial neural network Methods 0.000 description 10
- 125000004122 cyclic group Chemical group 0.000 description 9
- 239000013598 vector Substances 0.000 description 9
- 230000002123 temporal effect Effects 0.000 description 8
- 230000015654 memory Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000010339 dilation Effects 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The present disclosure relates to an image processing apparatus, an image processing method, and a computer-readable storage medium. An image processing apparatus according to the present disclosure includes a processing circuit configured to: dividing a plurality of images continuously input into a plurality of image blocks; extracting the space-time characteristics of each image block by using a convolutional neural network model, wherein the convolutional neural network model comprises a separable convolutional network and a point-by-point convolutional network or comprises a separable convolutional network and a cavity convolutional network; and determining gestures included in the plurality of images according to the spatiotemporal features of the respective image blocks by using a recurrent neural network model. With the image processing apparatus, the image processing method, and the computer-readable storage medium according to the present disclosure, a dynamic gesture can be recognized quickly and accurately.
Description
The present application claims priority of chinese patent application having an application number of 202010407312.X, entitled "image processing apparatus, image processing method, and computer readable storage medium", filed by the chinese patent office on 14/5/2020, which is incorporated herein by reference in its entirety.
Embodiments of the present disclosure generally relate to the field of image processing, and in particular, to an image processing apparatus, an image processing method, and a computer-readable storage medium. More particularly, embodiments of the present disclosure relate to an image processing apparatus, an image processing method, and a computer-readable storage medium capable of recognizing a gesture included in a plurality of images continuously input.
Dynamic gesture recognition refers to a technique for recognizing a dynamic gesture sequence composed of a plurality of frames of images that are continuously input. Due to the flexibility and convenience of gestures, dynamic gesture recognition has a wide application prospect in environments such as human-computer interaction, AR (Augmented Reality)/VR (Virtual Reality) and the like.
Online dynamic gesture recognition is a technique for segmenting and recognizing a plurality of dynamic gestures in succession. Compared with offline dynamic gesture recognition, online dynamic gesture recognition has great challenges mainly in two aspects: distinguishing a starting frame and an ending frame of the gesture; and recognizing the gesture. For online dynamic gesture recognition techniques, different gestures can be distinguished by selecting one or several keyframes for each type of gesture, but there is a strong uncertainty since the keyframes need to be selected manually. Furthermore, with a large variety of gestures, it is difficult to select a suitable key frame for each type of gesture. For the online dynamic gesture recognition technology, adjacent image frames can also be modeled by a hidden Markov model to distinguish different gestures. However, hidden markov models are relatively weak in expression, and therefore only a few types of gestures can be recognized.
Therefore, a technical solution is needed to quickly and accurately recognize the dynamic gesture.
Disclosure of Invention
This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.
An object of the present disclosure is to provide an image processing apparatus, an image processing method, and a computer-readable storage medium to quickly and accurately recognize a dynamic gesture.
According to an aspect of the present disclosure, there is provided an image processing apparatus, comprising a processing circuit configured to: dividing a plurality of images continuously input into a plurality of image blocks; extracting the spatiotemporal characteristics of each image block by using a convolutional neural network model, wherein the convolutional neural network model comprises a separable Convolution (partial Convolution) network and a point-by-point Convolution (point Convolution) network, or comprises a separable Convolution network and a hole Convolution (related) network; and determining gestures included in the plurality of images according to spatio-temporal characteristics of each image block by using a Recurrent Neural Network (RNN) model.
According to another aspect of the present disclosure, there is provided an image processing method including: dividing a plurality of images continuously input into a plurality of image blocks; extracting the space-time characteristics of each image block by using a convolutional neural network model, wherein the convolutional neural network model comprises a separable convolutional network and a point-by-point convolutional network or comprises a separable convolutional network and a cavity convolutional network; and determining gestures included in the plurality of images according to the spatiotemporal features of the respective image blocks by using a recurrent neural network model.
According to another aspect of the present disclosure, there is provided a computer-readable storage medium comprising executable computer instructions that, when executed by a computer, cause the computer to perform an image processing method according to the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program which, when executed by a computer, causes the computer to perform the image processing method according to the present disclosure.
With the image processing apparatus, the image processing method, and the computer-readable storage medium according to the present disclosure, spatiotemporal features of an image block may be extracted using a convolutional neural network, which includes a separable convolutional network and a point-by-point convolutional network, or includes a separable convolutional network and a hole convolutional network, so that a gesture may be recognized using the recurrent neural network according to the extracted spatiotemporal features. Due to the adoption of the separable convolutional network and the point-by-point convolutional network/cavity convolutional network, the calculation amount of gesture recognition can be reduced, and the dynamic gesture can be recognized quickly and accurately.
Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure. In the drawings:
FIG. 1 is a schematic diagram showing a gesture included in a plurality of images in succession;
fig. 2 is a block diagram showing an example of a configuration of an image processing apparatus according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram illustrating a process of extracting keypoints in an image according to an embodiment of the present disclosure;
FIG. 4 is a block diagram illustrating an example of the structure of a convolutional neural network model in accordance with an embodiment of the present disclosure;
FIG. 5 is a block diagram illustrating an example of the structure of a convolutional neural network model in accordance with an embodiment of the present disclosure;
FIG. 6 is a block diagram illustrating an example of the structure of a convolutional neural network model in accordance with an embodiment of the present disclosure;
FIG. 7 is a block diagram illustrating an example of the structure of a convolutional neural network model in accordance with an embodiment of the present disclosure;
FIG. 8 is a block diagram illustrating an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;
fig. 9 is a block diagram illustrating an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;
fig. 10 is a block diagram illustrating an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;
FIG. 11 is a diagram showing the structure of a recurrent neural network model;
FIG. 12 is a schematic diagram illustrating the structure of a recurrent neural network model, according to an embodiment of the present disclosure;
fig. 13 is a schematic diagram showing a structure of an image processing apparatus according to an embodiment of the present disclosure;
fig. 14 is a schematic diagram showing a structure of an image processing apparatus according to an embodiment of the present disclosure;
fig. 15 is a flowchart illustrating an image processing method according to an embodiment of the present disclosure; and
fig. 16 is a block diagram illustrating an example of an electronic device that can implement an image processing apparatus according to the present disclosure.
While the disclosure is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the description herein of specific embodiments is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure. It is noted that throughout the several views, corresponding reference numerals indicate corresponding parts.
Examples of the present disclosure will now be described more fully with reference to the accompanying drawings. The following description is merely exemplary in nature and is not intended to limit the present disclosure, application, or uses.
Example embodiments are provided so that this disclosure will be thorough, and will fully convey the scope to those skilled in the art. Numerous specific details are set forth such as examples of specific components, devices, and methods to provide a thorough understanding of embodiments of the present disclosure. It will be apparent to those skilled in the art that specific details need not be employed, that example embodiments may be embodied in many different forms and that neither should be construed to limit the scope of the disclosure. In certain example embodiments, well-known processes, well-known structures, and well-known technologies are not described in detail.
The description will be made in the following order:
1. a configuration example of an image processing apparatus;
2. an example of an image processing method;
3. application examples.
<1. Configuration example of image processing apparatus >
Fig. 1 is a schematic diagram showing a gesture included in a plurality of images in succession. As shown in FIG. 1, the upper diagram shows an example including a "double-click" gesture in the plurality of images, and the lower diagram shows an example including a "pinch" gesture in the plurality of images.
In the foregoing, under the condition that the types of gestures gradually increase, it is difficult for the existing gesture recognition technology to quickly and accurately recognize various types of gestures. Accordingly, it is desirable to provide an image processing apparatus, an image processing method, and a computer-readable storage medium to quickly and accurately recognize various dynamic gestures.
Fig. 2 is a block diagram illustrating an example of the configuration of an image processing apparatus 200 according to an embodiment of the present disclosure. Here, the image processing apparatus 200 may recognize a gesture included in a plurality of images continuously input. A plurality of images input in succession, such as a video, a moving image, or a group of still images input rapidly, and the like. Specifically, the image processing apparatus 200 may recognize the dynamic gesture in real time, that is, may recognize the dynamic gesture on line.
As shown in fig. 2, the image processing apparatus 200 may include a preprocessing unit 210, an extraction unit 220, and a determination unit 230.
Here, each unit of the image processing apparatus 200 may be included in the processing circuit. The image processing apparatus 200 may include one processing circuit or a plurality of processing circuits. Further, the processing circuitry may include various discrete functional units to perform various different functions and/or operations. It should be noted that these functional units may be physical entities or logical entities, and that units called differently may be implemented by the same physical entity.
According to an embodiment of the present disclosure, the preprocessing unit 210 may divide a plurality of images continuously input into a plurality of image blocks.
According to an embodiment of the present disclosure, the extraction unit 220 may extract spatiotemporal features of each image block using a convolutional neural network model. According to embodiments of the present disclosure, the convolutional neural network model may include a separable convolutional network and a point-by-point convolutional network. Alternatively, the convolutional neural network model may also include a separable convolutional network and a hole convolutional network.
According to an embodiment of the present disclosure, the determination unit 230 may determine the gestures included in the plurality of images according to spatiotemporal features of the respective image blocks using a recurrent neural network model.
As described above, according to the image processing apparatus 200 of the embodiment of the present disclosure, the spatiotemporal features of the image blocks may be extracted using a convolutional neural network model including a separable convolutional network and a point-by-point convolutional network, or including a separable convolutional network and a hole convolutional network, so that the gesture may be recognized using a recurrent neural network according to the extracted spatiotemporal features. Due to the adoption of the separable convolution network and the point-by-point convolution network/cavity convolution network, the calculation amount of gesture recognition can be reduced, and the dynamic gesture can be rapidly and accurately recognized.
In the present disclosure, separable convolution is also referred to as depth separable convolution (depth separable convolution), which reduces the number of parameters required for convolution calculation by splitting the correlation of spatial dimension and channel (depth) dimension. The convolution calculation of the depth separable convolution is divided into two parts, firstly, the spatial convolution is respectively carried out on channels (depths), the output is spliced, and then, the unit convolution kernel is used for carrying out the channel convolution to obtain the characteristic diagram.
In the present disclosure, the point-by-point convolution uses a 1x1 convolution kernel, or a convolution kernel that traverses each point. The depth of the convolution kernel is the number of channels of the image input to the point-by-point convolution network.
In this disclosure, hole convolution, also referred to as dilation convolution, is the injection of holes in a convolution kernel. In the hole convolution, one parameter can set the hole rate, and the specific meaning is that the hole rate is filled to-1 and 0 in a convolution kernel. When different void rates are set, the receptive field will be different. Thus. The cavity convolution can enlarge the receptive field and obtain multi-scale context information.
According to an embodiment of the present disclosure, the input of the image processing apparatus is a plurality of images (or a plurality of frames of images) including a gesture. According to an embodiment of the present disclosure, the image may be any one of an RGB image and a depth image.
According to an embodiment of the present disclosure, the preprocessing unit 210 may divide a plurality of images input to the image processing apparatus 200 into a plurality of image blocks. Specifically, the preprocessing unit 210 may divide M images, which are continuously input among the plurality of images input to the image processing apparatus 200, into one image block, M being an integer of 2 or more. That is, the pre-processing unit 210 may divide a plurality of images input to the image processing apparatus into a plurality of image blocks in units of M images. Here, image blocks each including M images can be regarded as one spatio-temporal unit. Preferably, M may be 4, 8, 16, 32, etc. For example, when M is 8, the preprocessing unit 210 may divide 8 images, which are continuously input among the plurality of images input to the image processing apparatus 200, into one image block, starting from an arbitrary position. For example, the preprocessing unit 210 may divide 1-8 th images among the plurality of images input to the image processing apparatus 200 into 1 st image blocks, 9-16 th images into 2 nd image blocks, and so on.
According to an embodiment of the present disclosure, the preprocessing unit 210 may also determine a feature of each of the divided image blocks, and may input the feature of each image block to the extraction unit 220.
According to an embodiment of the present disclosure, the preprocessing unit 210 may extract features of a plurality of key points of each of a plurality of images input to the image processing apparatus 200. Further, the preprocessing unit 210 may take the feature of each key point of each of the M images included in the image block as the feature of the image block.
Here, in the case of recognizing a gesture, the key point may be, for example, a joint point of a hand that makes the gesture. The present disclosure does not limit the number of key points included in each image. For example, the preprocessing unit 210 may extract features of X key points of each image, where X is an integer greater than or equal to 2. For example, in the case of X =14, the preprocessing unit 210 may take, as the feature of the image block, features of 14 key points of each of the M images included in the image block. Then, the image block has a total of 14 × M key points.
Fig. 3 is a schematic diagram illustrating a process of extracting key points in an image according to an embodiment of the present disclosure. Fig. 3 shows three images among the images input to the image processing apparatus 200 in the upper diagram, and shows a process of performing the keypoint extraction on the three images in the lower diagram. As shown in fig. 3, 14 keypoints are extracted for each image.
According to an embodiment of the present disclosure, the feature of each keypoint may comprise a feature of multiple dimensions. Further, the feature of each keypoint may be a spatial feature of that keypoint. For example, the features of each keypoint may include the Y spatial features of that keypoint. Y is, for example, 3. That is, the feature of each keypoint may include three coordinate features of that keypoint in three-dimensional space.
As described above, according to an embodiment of the present disclosure, one image block includes M images, each image includes X keypoints, and each keypoint includes Y spatial features. Then, each image block may comprise M × X × Y features. The preprocessing unit 210 may input M × X × Y features included in each image block as features of the image block to the convolutional neural network model in the extraction unit 220. Further, the preprocessing unit 210 may sequentially input the features of the respective image blocks to the extraction unit 220 in the order of the image blocks. That is, the features of an image block temporally preceding are input to extraction section 220 earlier than a temporally succeeding image block.
According to an embodiment of the present disclosure, the extraction unit 220 may extract spatiotemporal features of each image block using a convolutional neural network model. The convolutional neural network model may include a separable convolutional network and a point-by-point convolutional network, or may include a separable convolutional network and a hole convolutional network.
According to an embodiment of the present disclosure, the convolutional neural network model in the extraction unit 220 may also include a fully connected network. Each node of the fully connected network is connected to all nodes of the previous network for integrating the extracted features of the previous network.
Fig. 4 is a block diagram illustrating an example of a structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in fig. 4, the convolutional neural network model may include a separable convolutional network, a point-by-point convolutional network, or a hole convolutional network, and a fully-connected network.
According to an embodiment of the present disclosure, the convolutional neural network model may include N separable convolutional networks, N point-by-point convolutional networks or void convolutional networks, and N fully-connected networks, where N is a positive integer. That is, the number of separable convolutional networks, point-by-point convolutional networks or void convolutional networks, and fully-connected networks included in the convolutional neural network model is the same. That is, the input of the convolutional neural network model sequentially includes N groups including a separable convolutional network, a point-by-point convolutional network or a hole convolutional network, and a fully-connected network, and the order from the input to the output in each group sequentially includes a separable convolutional network, a point-by-point convolutional network or a hole convolutional network, and a fully-connected network.
For convenience of illustration, the separable convolution network may be labeled as a, the point-by-point convolution network or the hole convolution network may be labeled as B, and the fully connected network may be labeled as C, and then the order of the convolution neural network model in the extraction unit 220 from input to output may include a, B, C, or a, B, C, a, B, C \8230.
Fig. 4 shows the case where N =1, i.e. the convolutional neural network model comprises a set comprising a separable convolutional network, a point-by-point convolutional network or a hole convolutional network, and a fully connected network.
Fig. 5 is a block diagram illustrating an example of a structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in fig. 5, the convolutional neural network model may include a separable convolutional network, a point-by-point convolutional network or a hole convolutional network, a fully connected network, a separable convolutional network, a point-by-point convolutional network or a hole convolutional network, and a fully connected network. That is, fig. 5 shows the case of N =2, i.e., the convolutional neural network model includes two groups including a separable convolutional network, a point-by-point convolutional network, or a hole convolutional network, and a fully connected network. The situation for N greater than 2 is similar and the disclosure is not repeated.
According to an embodiment of the present disclosure, the convolutional neural network model in the extraction unit 220 may include a plurality of separable convolutional networks, one or more point-by-point convolutional networks or hole convolutional networks, and one fully-connected network.
According to embodiments of the present disclosure, the convolutional neural network model may include a plurality of separable convolutional networks, one or more point-wise convolutional networks or void convolutional networks, and one fully-connected network. Wherein, the number of the separable convolution networks is one more than that of the point-by-point convolution networks or the cavity convolution networks. For example, if the number of point-by-point convolutional networks or hole convolutional networks is V, and V is a positive integer, the number of separable convolutional networks is V +1. The order of the convolutional neural network model from input to output can sequentially comprise V groups of a separable convolutional network, a point-by-point convolutional network or a hole convolutional network, a separable convolutional network and a fully-connected network. Further, the order of each group from input to output in the V groups may include, in turn, a separable convolutional network, and a point-by-point convolutional network or a hole convolutional network. That is, in the architecture before the fully connected network, one starts with the separable convolutional network, one ends with the separable convolutional network, and the point-by-point convolutional network or the void convolutional network are spaced apart.
For convenience of illustration, the separable convolution network may be labeled as a, the point-by-point convolution network or the hole convolution network may be labeled as B, and the fully connected network may be labeled as C, and then the order of the convolution neural network model in the extraction unit 220 from input to output may include a, B, a, C or a, B, a, \8230;, a, B, C.
Fig. 6 is a block diagram illustrating an example of a structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in fig. 6, the convolutional neural network model in the extraction unit 220 may include a separable convolutional network, a point-by-point convolutional network or a hole convolutional network, a separable convolutional network, and a fully-connected network. That is, fig. 6 shows the case of V =1. The same is true for the case where V is greater than 1, and the disclosure is not repeated.
According to embodiments of the present disclosure, the convolutional neural network model may include a plurality of separable convolutional networks, a plurality of point-by-point convolutional networks or hole convolutional networks, and one fully-connected network. The number of the separable convolutional networks is consistent with that of the point-by-point convolutional networks or the void convolutional networks, for example, the number of the separable convolutional networks is Z, and Z is an integer greater than or equal to 2. The order of the convolutional neural network model from input to output may sequentially include Z groups of separable convolutional networks, and point-by-point convolutional networks or void convolutional networks, and one fully connected network. Further, the order of each group from input to output in the Z groups may include, in turn, a separable convolutional network, and a point-by-point convolutional network or a hole convolutional network. That is, in the structure before the fully connected network, it starts with the separable convolutional network, ends with the point-by-point convolutional network or the void convolutional network, and is spaced apart from the separable convolutional network and the point-by-point convolutional network or the void convolutional network.
For convenience of illustration, the separable convolutional network may be labeled as a, the point-by-point convolutional network or the hole convolutional network is labeled as B, and the fully-connected network is labeled as C, and then the order of the convolutional neural network model in the extraction unit 220 from input to output may include a, B, C, or a, B, \\ 8230;, a, B, C.
Fig. 7 is a block diagram illustrating an example of a structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in fig. 7, the convolutional neural network model in the extraction unit 220 may include a separable convolutional network, a point-by-point convolutional network or a hole convolutional network, and a fully-connected network. That is, fig. 7 shows the case of Z = 2. The same is true for the case where Z is greater than 2, and the disclosure will not be repeated.
The structure of the convolutional neural network model in the extraction unit 220 is described above in an exemplary manner. Several specific examples of convolutional neural network models according to embodiments of the present disclosure will be described below.
According to an embodiment of the present disclosure, the step size of the separable convolutional network in the convolutional neural network model may be 1, and the point-by-point convolutional network or the void convolutional network in the convolutional neural network model may be selected as the point-by-point convolutional network.
Fig. 8 is a block diagram illustrating an example of a structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in fig. 8, the convolutional neural network model may include a separable convolutional network with a step size of 1, a point-by-point convolutional network, and a fully-connected network. Here, M × N denotes the size of a convolution kernel in the separable convolution network, and P denotes the number of convolution kernels in the separable convolution network. Preferably, M = N =3. Sxt represents the size of the convolution kernel in the point-by-point convolution network, and Q represents the number of convolution kernels in the point-by-point convolution network. Preferably, S = T =1.
According to an embodiment of the present disclosure, in fig. 8, since the step size of the separable convolutional network is 1, local spatiotemporal information of an image block may be extracted. Here, the spatiotemporal information may include temporal information and spatial information. Since the features of the image block include spatial features of the respective key points, the extraction unit 220 may extract the spatial features of the image block. Since each image block includes a plurality of images that are temporally continuous, the extraction unit 220 may extract temporal features of the image block.
It is noted that, for ease of illustration, fig. 8 illustrates an example of a convolutional neural network model including a separable convolutional network, a point-by-point convolutional network, and a fully-connected network. However, fig. 8 can be arbitrarily modified according to the structure of the convolutional neural network model described in the foregoing.
According to an embodiment of the present disclosure, the step size of the separable convolutional network in the convolutional neural network model may be greater than 1, and the point-by-point convolutional network or the void convolutional network in the convolutional neural network model may be selected from the point-by-point convolutional networks.
Fig. 9 is a block diagram illustrating an example of a structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in fig. 9, the convolutional neural network model may include a separable convolutional network with a step size greater than 1, a point-by-point convolutional network, and a fully-connected network. Here, M × N denotes the size of the convolution kernel in the separable convolution network, and P denotes the number of convolution kernels in the separable convolution network. Preferably, M = N =3. Sxt represents the size of the convolution kernel in the point-by-point convolution network, and Q represents the number of convolution kernels in the point-by-point convolution network. Preferably, S = T =1.
According to an embodiment of the present disclosure, in fig. 9, since the step size of the separable convolutional network is greater than 1, spatiotemporal information related to a medium distance of the image block may be extracted. Wherein the medium-distance related spatiotemporal information is spatiotemporal information intermediate between the local spatiotemporal information and the global spatiotemporal information, depending on the size of the step size. Similarly, the spatiotemporal information may include temporal information and spatial information. Since the features of the image block include spatial features of the respective key points, the extraction unit 220 may extract the spatial features of the image block. Since each image block includes a plurality of images that are temporally continuous, the extraction unit 220 may extract temporal features of the image block.
It is noted that, for ease of illustration, fig. 9 illustrates an example of a convolutional neural network model including a separable convolutional network, a point-by-point convolutional network, and a fully-connected network. However, fig. 9 can be arbitrarily modified according to the structure of the convolutional neural network model described in the foregoing.
According to an embodiment of the present disclosure, the step size of the separable convolutional network in the convolutional neural network model may be 1, and the point-by-point convolutional network or the void convolutional network in the convolutional neural network model may select the void convolutional network.
Fig. 10 is a block diagram illustrating an example of a structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in fig. 10, the convolutional neural network model may include a separable convolutional network with a step size of 1, a hole convolutional network, and a fully-connected network. Here, M × N denotes the size of a convolution kernel in the separable convolution network, and P denotes the number of convolution kernels in the separable convolution network. Preferably, M = N =3.S multiplied by T represents the size of a convolution kernel in the hole convolution network, and Q represents the number of the convolution kernels in the hole convolution network. Preferably, S =5,t =3.
According to the embodiment of the present disclosure, in fig. 10, since the hole convolution network has a large receptive field, the global spatio-temporal information of the image block may be extracted. Similarly, the spatiotemporal information may include temporal information and spatial information. Since the features of the image block include spatial features of the respective key points, the extraction unit 220 may extract the spatial features of the image block. Since each image block includes a plurality of images that are temporally continuous, the extraction unit 220 may extract temporal features of the image block.
It is noted that, for ease of illustration, fig. 10 illustrates an example of a convolutional neural network model including a separable convolutional network, a hole convolutional network, and a fully connected network. However, fig. 10 can be arbitrarily modified according to the structure of the convolutional neural network model described in the foregoing.
Various examples of convolutional neural network models in the extraction unit 220 according to embodiments of the present disclosure are described above. The above examples are merely exemplary, and the present disclosure is not limited to these structures. Determining unit 230 according to an embodiment of the present disclosure will be described below.
According to an embodiment of the present disclosure, the determination unit 230 may determine the gestures included in the plurality of images according to the spatiotemporal features of the respective image blocks output by the extraction unit 220 using a recurrent neural network model. Specifically, the determining unit 230 may determine (model) a temporal relationship between the respective image blocks according to spatiotemporal features of the respective image blocks output by the extracting unit 220, thereby outputting a state vector representing a gesture.
Fig. 11 is a schematic diagram showing the structure of the recurrent neural network model. Here, the recurrent neural network model shown in fig. 11 is a currently common recurrent neural network model. As shown in FIG. 11, the output o of the recurrent neural network model at time t t With input x at time t t And an output h at the last instant t-1 t-1 It is related. That is, in the recurrent neural network, a neuron receives not only information of other neurons but also information of itself to form a network structure having a loop, and thus is also referred to as a neural network having short-term memory.
According to an embodiment of the present disclosure, the recurrent neural network model may determine the output information at the current time from the input information at the current time, the proportional information of the output at the previous time, and the integral information of the output at the previous time and/or the differential information of the output at the previous time.
According to the embodiment of the present disclosure, the ratio information of the output at the previous time may be, for example, the output at the previous time, or may be information calculated at a certain ratio based on the output at the previous time.
According to an embodiment of the present disclosure, the integration information of the output at the previous time indicates information obtained by integrating the output at the previous time.
According to an embodiment of the present disclosure, the differential information of the output at the previous time indicates information obtained by performing a differential operation on the output at the previous time. For example, the differential information of the output at the previous time may include 1 st order to K th order differential information of the output at the previous time, that is, information obtained by performing 1 st order to K th order differential operation on the output at the previous time. Wherein K is an integer of 2 or more.
Fig. 12 is a schematic diagram illustrating a structure of a recurrent neural network model according to an embodiment of the present disclosure. In FIG. 12, x t Indicating the input information at time t, o t Representing the output information at time t, which is equal to h t ,h t-1 Indicating the output information at time t-1, and also indicating the ratio information of the output information at time t-1, S t-1 Integral information representing the output information at time t-1,1 st order differential information indicating the output information at time t-1,k-order differential information indicating output information at time t-1.
According to an embodiment of the present disclosure, integral information S of output information at time t-1 may be calculated using the following formula t-1 :
According to an embodiment of the present disclosure, 1 st order differential information of the output information at time t-1 may be calculated using the following formula
According to the embodiment of the present disclosure2 nd order differential information of the output information at the time t-1 is calculated using the following formula
In a similar manner, K-order differential information of the output information at time t-1 can be calculated.
According to an embodiment of the present disclosure, the output information h at time t may be calculated according to the following formula t :
h t =σ(W he E t +b h )
Wherein, W he Represents a state update matrix, σ is an activation function, including but not limited to a ReLU (Rectified Linear Unit) function, b h The offset vector can be set based on empirical values. E t The expression state formula, i.e. the memory of the recurrent neural network at time t, can be calculated according to the following formula:
as described above, in fig. 12, the recurrent neural network model can determine the output information at the present time by determining the state at the present time from the input information at the present time, the proportional information of the output at the previous time, and the integral information of the output at the previous time and the differential information of the output at the previous time. Note that although fig. 12 shows an example in which the output information at the current time is determined from the input information at the current time, the proportional information of the output at the previous time, and the integral information of the output at the previous time and the differential information of the output at the previous time, the output information at the current time may be determined from the input information at the current time, the proportional information of the output at the previous time, and the integral information of the output at the previous time, or the output information at the current time may be determined from the input information at the current time, the proportional information of the output at the previous time, and the differential information of the output at the previous time.
As described above, according to the embodiment of the present disclosure, the recurrent neural network in the determination unit 230 can determine the output at the present time not only from the input information at the present time and the output at the previous time, but also from at least one of the integral information of the output at the previous time and the differential information of the output at the previous time. Here, since the proportional information of the output information focuses on the state of the current image block, the differential information of the output information focuses on the change of the state, and the integral information of the output information focuses on the accumulation of the state, the determination unit 230 according to the embodiment of the present disclosure can more comprehensively acquire the change and trend of the gesture on the time scale, thereby obtaining better recognition accuracy.
According to an embodiment of the present disclosure, the extracting unit 220 may obtain spatiotemporal features of each image block, and since the gesture may include a plurality of image blocks, the determining unit 230 may model a temporal relationship between different image blocks, so that the gesture may be accurately and rapidly recognized.
According to an embodiment of the present disclosure, as shown in fig. 2, the image processing apparatus 200 may further include a decision unit 240 for determining a final gesture according to an output of the determination unit 230.
According to an embodiment of the present disclosure, the output of the recurrent neural network in the determining unit 230 may be 128-dimensional state vectors corresponding to different gestures, which are determined according to spatiotemporal features of the respective image blocks. The decision unit 240 may include a classifier for determining the state vector output by the determination unit 230 as a gesture.
According to an embodiment of the present disclosure, the extraction unit 220 may include a convolutional neural network model, and the determination unit 230 may include a cyclic neural network model, so that the decision unit 240 may determine the final gesture according to the output of the cyclic neural network model.
Fig. 13 is a schematic diagram illustrating a structure of an image processing apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the input of the image processing apparatus 200 sequentially passes through the convolutional neural network model in the extraction unit 220, the cyclic neural network model in the determination unit 230, and the classifier in the decision unit 240, thereby outputting the recognition result of the gesture.
According to an embodiment of the present disclosure, the extraction unit 220 may include a plurality of convolutional neural network models, and the determination unit 230 may include a plurality of cyclic neural network models, so that the decision unit 240 may determine a final gesture according to an output result of each of the plurality of cyclic neural network models. Here, the inputs of the plurality of convolutional neural network models are all the same, i.e., a plurality of images input to the image processing apparatus 200. That is, the state vector of the gesture is determined by using each convolutional neural network model and the cyclic neural network model, respectively, and then the classifier in the decision unit 230 may determine the final gesture. For example, the classifier may average the state vectors output by each recurrent neural network model and then determine the final gesture.
Fig. 14 is a schematic diagram illustrating a structure of an image processing apparatus according to an embodiment of the present disclosure. As shown in fig. 14, the image processing apparatus 200 includes R convolutional neural network models, R cyclic neural network models, and a classifier. Wherein R is an integer of 2 or more. Specifically, the input plurality of images are input to a convolutional neural network model 1 and a recurrent neural network model 1 to obtain a 1 st set of 128-dimensional state vectors, the input plurality of images are input to a convolutional neural network model 2 and a recurrent neural network model 2 to obtain a 2 nd set of 128-dimensional state vectors, \8230, and the input plurality of images are input to a convolutional neural network model R and a recurrent neural network model R to obtain an R set of 128-dimensional state vectors. The classifier can synthesize output results of the R recurrent neural network models, so that a final gesture recognition result is obtained.
As described above, according to embodiments of the present disclosure, gestures may be recognized using a plurality of sets of convolutional neural network models and cyclic neural network models, thereby making recognized gestures more accurate.
As described above, the convolutional neural network model including the separable convolutional network and the point-by-point convolutional network with the step size of 1 may extract local spatio-temporal information of the image block, the convolutional neural network model including the separable convolutional network and the point-by-point convolutional network with the step size of more than 1 may extract spatio-temporal information related to a medium distance, and the neural network model including the separable convolutional network and the hole convolutional network with the step size of 1 may extract global spatio-temporal information of the image block. Thus, according to embodiments of the present disclosure, the R convolutional neural network models may include convolutional neural network models capable of extracting spatiotemporal information of different scales. That is, the R convolutional neural network models may include at least two of the above three neural network models.
For example, in the case of R =2, a first convolutional neural network model of the R convolutional neural network models may include a separable convolutional network having a step size of 1 and a point-by-point convolutional network, and a second convolutional neural network model of the R convolutional neural network models may include a separable convolutional network having a step size of greater than 1 and a point-by-point convolutional network. In the case of R =2, a first convolutional neural network model of the R convolutional neural network models may include a separable convolutional network with a step size of 1 and a point-by-point convolutional network, and a second convolutional neural network model of the R convolutional neural network models may include a separable convolutional network with a step size of 1 and a hole convolutional network. In the case of R =2, a first convolutional neural network model of the R convolutional neural network models may include a separable convolutional network having a step size greater than 1 and a point-by-point convolutional network, and a second convolutional neural network model of the R convolutional neural network models may include a separable convolutional network having a step size of 1 and a hole convolutional network. In the case of R =3, a first convolutional neural network model of the R convolutional neural network models may include a separable convolutional network having a step size of 1 and a point-by-point convolutional network, a second convolutional neural network model of the R convolutional neural network models may include a separable convolutional network having a step size of greater than 1 and a point-by-point convolutional network, and a third convolutional neural network model of the R convolutional neural network models includes a separable convolutional network having a step size of 1 and a hole convolutional network.
As described above, according to an embodiment of the present disclosure, in the case where the extraction unit 220 includes a plurality of convolutional neural network models, the plurality of convolutional neural network models may extract spatiotemporal information of different scales of image blocks, and thus may simultaneously satisfy the requirement of rapidly and accurately recognizing a gesture.
According to the embodiment of the present disclosure, in the process of training the image processing apparatus 200, two stages may be divided. In the first stage, the entire network may be pre-trained with manually calibrated gestures and cross-entropy loss functions, training the entire network with only one gesture included in multiple images. In the second stage, the network after pre-training may be adjusted with the gesture after expansion (i.e., adding noise to the gesture on a time axis such that the length of the image corresponding to the gesture increases or decreases) and the connection time classification loss function, such that the entire network is trained in the case where the plurality of images includes a plurality of gestures and the length of the image of each gesture increases or decreases. According to the embodiment of the disclosure, after the two stages of training, the image processing apparatus 200 is enabled to quickly and accurately recognize the dynamic gesture.
As described above, according to the image processing apparatus 200 of the embodiment of the present disclosure, a plurality of input images may be divided into a plurality of image blocks, and spatiotemporal features of the image blocks may be extracted using a separable convolutional network and a point-by-point convolutional network or a hole convolutional network, thereby greatly reducing the amount of computation in the gesture recognition process. Further, in the case where the image processing apparatus 200 includes a plurality of convolutional neural network models, spatiotemporal features of different scales of image blocks may be extracted, thereby simultaneously ensuring accuracy and rapidity of recognition. In addition, the spatiotemporal features of the respective image blocks are processed by using a recurrent neural network, which takes into account the accumulated output proportional information, integral information and/or differential information, thereby making the recognition result more accurate. In summary, the image processing apparatus 200 according to the embodiment of the present disclosure can quickly and accurately recognize a dynamic gesture.
<2. Example of image processing method >
Next, an image processing method performed by the image processing apparatus 200 according to an embodiment of the present disclosure will be described in detail.
Fig. 15 is a flowchart illustrating an image processing method performed by the image processing apparatus 200 according to an embodiment of the present disclosure.
As shown in fig. 15, in step S1510, a plurality of images continuously input are divided into a plurality of image blocks.
Next, in step S1520, spatiotemporal features of each image block are extracted using a convolutional neural network model, which includes a separable convolutional network and a point-by-point convolutional network, or includes a separable convolutional network and a hole convolutional network.
Next, in step S1530, gestures included in the plurality of images are determined from spatiotemporal features of the respective image blocks using a recurrent neural network model.
Preferably, the dividing of the plurality of images, which are continuously input, into the plurality of image blocks includes: dividing continuously input M images into one image block, M being an integer greater than or equal to 2, and wherein extracting spatiotemporal features of each image block using a convolutional neural network model comprises: and inputting the characteristics of each key point of each of the M images into the convolutional neural network model as the characteristics of the image block.
Preferably, the convolutional neural network model further comprises a fully connected network.
Preferably, the convolutional neural network model comprises: a plurality of separable convolutional networks, one or more point-by-point convolutional networks or void convolutional networks, and a fully connected network; or N separable convolutional networks, N point-by-point convolutional networks or void convolutional networks, and N fully-connected networks, wherein N is a positive integer.
Preferably, the image processing method further includes: determining a gesture included in the plurality of images using the plurality of convolutional neural network models and the plurality of recurrent neural network models, respectively; and determining a final gesture according to the output result of each recurrent neural network model.
Preferably, a first convolutional neural network model of the plurality of convolutional neural network models includes a separable convolutional network with a step size of 1 and a point-by-point convolutional network, a second convolutional neural network model of the plurality of convolutional neural network models includes a separable convolutional network with a step size greater than 1 and a point-by-point convolutional network, and a third convolutional neural network model of the plurality of convolutional neural network models includes a separable convolutional network with a step size of 1 and a hole convolutional network.
Preferably, determining the gesture included in the plurality of images using the recurrent neural network model comprises: the output information at the present time is determined based on the input information at the present time, the proportional information of the output at the previous time, and the integral information of the output at the previous time and/or the differential information of the output at the previous time.
According to an embodiment of the present disclosure, the main body performing the above method may be the image processing apparatus 200 according to an embodiment of the present disclosure, and thus all the embodiments described above with respect to the image processing apparatus 200 are applicable thereto.
<3. Application example >
The present disclosure may be applied to various scenarios. For example, the image processing apparatus 200 of the present disclosure may be used for gesture recognition, and particularly, recognition of an online dynamic gesture may be performed. Furthermore, although the present disclosure is described with online dynamic gesture recognition as an example, the present disclosure is not limited thereto, and the present disclosure may be applied to other scenarios related to the processing of a time-series signal.
Fig. 16 is a block diagram illustrating an example of an electronic device 1600 that may implement the image processing apparatus 200 according to the present disclosure. The electronic device 1600 may be, for example, a user device, which may be implemented, for example, as a mobile terminal such as a smartphone, tablet Personal Computer (PC), notebook PC, portable game terminal, portable/cryptographic dog-type mobile router, and digital camera, or as an in-vehicle terminal.
The electronic device 1600 includes a processor 1601, memory 1602, a storage device 1603, a network interface 1604, and a bus 1606.
The processor 1601 may be, for example, a Central Processing Unit (CPU) or a Digital Signal Processor (DSP), and controls the functions of the electronic device 1600. The memory 1602 includes a Random Access Memory (RAM) and a Read Only Memory (ROM), and stores data and programs executed by the processor 1601. The storage device 1603 may include a storage medium such as a semiconductor memory and a hard disk.
The network interface 1604 is a wired communication interface for connecting the electronic device 1600 to a wired communication network 1605. The wired communication network 1605 may be a core network such as an Evolved Packet Core (EPC) or a Packet Data Network (PDN) such as the internet.
The bus 1606 connects the processor 1601, the memory 1602, the storage device 1603, and the network interface 1604 to each other. The bus 1606 may include two or more buses each having a different speed (such as a high-speed bus and a low-speed bus).
In the electronic device 1600 shown in fig. 16, the preprocessing unit 210, the extracting unit 220, the determining unit 230, and the deciding unit 240 described by using fig. 2 may be implemented by a processor 1601. For example, the processor 1601 may perform functions of dividing a plurality of images continuously input into a plurality of image blocks, extracting spatiotemporal features of each image block using a convolutional neural network model, and determining a gesture included in the plurality of images using a cyclic neural network by executing instructions stored in the memory 1602 or the storage 1603.
The preferred embodiments of the present disclosure are described above with reference to the drawings, but the present disclosure is of course not limited to the above examples. Various changes and modifications within the scope of the appended claims may be made by those skilled in the art, and it should be understood that these changes and modifications naturally will fall within the technical scope of the present disclosure.
For example, the units shown in the functional block diagrams in the figures as dashed boxes each indicate that the functional unit is optional in the corresponding apparatus, and the respective optional functional units may be combined in an appropriate manner to implement the required functions.
For example, a plurality of functions included in one unit in the above embodiments may be implemented by separate devices. Alternatively, a plurality of functions implemented by a plurality of units in the above embodiments may be implemented by separate devices, respectively. In addition, one of the above functions may be implemented by a plurality of units. Needless to say, such a configuration is included in the technical scope of the present disclosure.
In this specification, the steps described in the flowcharts include not only the processing performed in time series in the described order but also the processing performed in parallel or individually without necessarily being performed in time series. Further, even in the steps processed in time series, needless to say, the order can be changed as appropriate.
Although the embodiments of the present disclosure have been described in detail with reference to the accompanying drawings, it should be understood that the above-described embodiments are merely illustrative of the present disclosure and do not constitute a limitation of the present disclosure. Various modifications and alterations to the above-described embodiments may be apparent to those skilled in the art without departing from the spirit and scope of the disclosure. Accordingly, the scope of the disclosure is to be defined only by the claims appended hereto, and by their equivalents.
Claims (15)
- An image processing apparatus comprising processing circuitry configured to:dividing a plurality of continuously input images into a plurality of image blocks;extracting the space-time characteristics of each image block by using a convolutional neural network model, wherein the convolutional neural network model comprises a separable convolutional network and a point-by-point convolutional network or comprises a separable convolutional network and a cavity convolutional network; anddetermining gestures included in the plurality of images from spatiotemporal features of the respective image blocks using a recurrent neural network model.
- The image processing apparatus of claim 1, wherein the processing circuitry is further configured to:dividing M continuously input images into an image block, wherein M is an integer greater than or equal to 2; andand inputting the characteristics of each key point of each image in the M images into the convolutional neural network model as the characteristics of the image block.
- The image processing apparatus of claim 1, wherein the convolutional neural network model further comprises a fully connected network.
- The image processing apparatus according to claim 3, wherein the convolutional neural network model includes:a plurality of separable convolutional networks, one or more point-by-point convolutional networks or void convolutional networks, and a fully connected network; or alternativelyThe network comprises N separable convolutional networks, N point-by-point convolutional networks or void convolutional networks and N fully-connected networks, wherein N is a positive integer.
- The image processing apparatus of claim 1, wherein the processing circuit is further configured to:determining a gesture included in the plurality of images using a plurality of convolutional neural network models and a plurality of recurrent neural network models, respectively; andand determining a final gesture according to the output result of each recurrent neural network model.
- The image processing apparatus of claim 5, wherein a first convolutional neural network model of the plurality of convolutional neural network models comprises a separable convolutional network with a step size of 1 and a point-by-point convolutional network, a second convolutional neural network model of the plurality of convolutional neural network models comprises a separable convolutional network with a step size greater than 1 and a point-by-point convolutional network, and a third convolutional neural network model of the plurality of convolutional neural network models comprises a separable convolutional network with a step size of 1 and a hole convolutional network.
- The image processing apparatus according to claim 1,the recurrent neural network model determines output information at the current time based on input information at the current time, proportional information of an output at a previous time, and integral information of the output at the previous time and/or differential information of the output at the previous time.
- An image processing method, comprising:dividing a plurality of continuously input images into a plurality of image blocks;extracting the space-time characteristics of each image block by using a convolutional neural network model, wherein the convolutional neural network model comprises a separable convolutional network and a point-by-point convolutional network or comprises a separable convolutional network and a cavity convolutional network; anddetermining gestures included in the plurality of images from spatiotemporal features of the respective image blocks using a recurrent neural network model.
- The image processing method according to claim 8, wherein dividing a plurality of images, which are continuously input, into a plurality of image blocks comprises: dividing continuously input M images into an image block, M being an integer of 2 or more, andthe method for extracting the space-time characteristics of each image block by using the convolutional neural network model comprises the following steps: and inputting the characteristics of each key point of each image in the M images into the convolutional neural network model as the characteristics of the image block.
- The image processing method of claim 8, wherein the convolutional neural network model further comprises a fully connected network.
- The image processing method of claim 10, wherein the convolutional neural network model comprises:a plurality of separable convolutional networks, one or more point-by-point convolutional networks or void convolutional networks, and a fully connected network; or alternativelyThe network comprises N separable convolutional networks, N point-by-point convolutional networks or void convolutional networks and N fully-connected networks, wherein N is a positive integer.
- The image processing method according to claim 8, wherein the image processing method further comprises:determining a gesture included in the plurality of images using a plurality of convolutional neural network models and a plurality of recurrent neural network models, respectively; andand determining a final gesture according to the output result of each recurrent neural network model.
- The image processing method of claim 12, wherein a first convolutional neural network model of the plurality of convolutional neural network models comprises a separable convolutional network with a step size of 1 and a point-by-point convolutional network, a second convolutional neural network model of the plurality of convolutional neural network models comprises a separable convolutional network with a step size greater than 1 and a point-by-point convolutional network, and a third convolutional neural network model of the plurality of convolutional neural network models comprises a separable convolutional network with a step size of 1 and a hole convolutional network.
- The image processing method of claim 8, wherein determining the gesture included in the plurality of images using a recurrent neural network model comprises:the output information at the present time is determined based on the input information at the present time, the proportional information of the output at the previous time, and the integral information of the output at the previous time and/or the differential information of the output at the previous time.
- A computer-readable storage medium comprising executable computer instructions that, when executed by a computer, cause the computer to perform the image processing method of any of claims 8-14.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010407312.XA CN113673280A (en) | 2020-05-14 | 2020-05-14 | Image processing apparatus, image processing method, and computer-readable storage medium |
CN202010407312X | 2020-05-14 | ||
PCT/CN2021/092004 WO2021227933A1 (en) | 2020-05-14 | 2021-05-07 | Image processing apparatus, image processing method, and computer-readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115349142A true CN115349142A (en) | 2022-11-15 |
Family
ID=78526428
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010407312.XA Pending CN113673280A (en) | 2020-05-14 | 2020-05-14 | Image processing apparatus, image processing method, and computer-readable storage medium |
CN202180023365.4A Pending CN115349142A (en) | 2020-05-14 | 2021-05-07 | Image processing apparatus, image processing method, and computer-readable storage medium |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010407312.XA Pending CN113673280A (en) | 2020-05-14 | 2020-05-14 | Image processing apparatus, image processing method, and computer-readable storage medium |
Country Status (2)
Country | Link |
---|---|
CN (2) | CN113673280A (en) |
WO (1) | WO2021227933A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113888541B (en) * | 2021-12-07 | 2022-03-25 | 南方医科大学南方医院 | Image identification method, device and storage medium for laparoscopic surgery stage |
CN118711824B (en) * | 2024-08-30 | 2025-01-21 | 南昌大学第一附属医院 | A method for analyzing the recovery status of critically ill patients based on parallel feature extraction |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10157309B2 (en) * | 2016-01-14 | 2018-12-18 | Nvidia Corporation | Online detection and classification of dynamic gestures with recurrent convolutional neural networks |
CN106991372B (en) * | 2017-03-02 | 2020-08-28 | 北京工业大学 | Dynamic gesture recognition method based on mixed deep learning model |
CN107180226A (en) * | 2017-04-28 | 2017-09-19 | 华南理工大学 | A kind of dynamic gesture identification method based on combination neural net |
CN108846440B (en) * | 2018-06-20 | 2023-06-02 | 腾讯科技(深圳)有限公司 | Image processing method and device, computer readable medium and electronic equipment |
CN110472531B (en) * | 2019-07-29 | 2023-09-01 | 腾讯科技(深圳)有限公司 | Video processing method, device, electronic equipment and storage medium |
CN110889387A (en) * | 2019-12-02 | 2020-03-17 | 浙江工业大学 | A real-time dynamic gesture recognition method based on multi-track matching |
CN111160114B (en) * | 2019-12-10 | 2024-03-19 | 深圳数联天下智能科技有限公司 | Gesture recognition method, gesture recognition device, gesture recognition equipment and computer-readable storage medium |
CN112036261A (en) * | 2020-08-11 | 2020-12-04 | 海尔优家智能科技(北京)有限公司 | Gesture recognition method and device, storage medium and electronic device |
CN112507898B (en) * | 2020-12-14 | 2022-07-01 | 重庆邮电大学 | A Multimodal Dynamic Gesture Recognition Method Based on Lightweight 3D Residual Network and TCN |
-
2020
- 2020-05-14 CN CN202010407312.XA patent/CN113673280A/en active Pending
-
2021
- 2021-05-07 CN CN202180023365.4A patent/CN115349142A/en active Pending
- 2021-05-07 WO PCT/CN2021/092004 patent/WO2021227933A1/en active Application Filing
Also Published As
Publication number | Publication date |
---|---|
CN113673280A (en) | 2021-11-19 |
WO2021227933A1 (en) | 2021-11-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11216729B2 (en) | Recognition system and recognition method | |
Wang et al. | Depth pooling based large-scale 3-d action recognition with convolutional neural networks | |
CN113255694B (en) | Training image feature extraction model and method and device for extracting image features | |
US10002290B2 (en) | Learning device and learning method for object detection | |
CN108960409B (en) | Method and device for generating annotation data and computer-readable storage medium | |
CN111931592B (en) | Object recognition method, device and storage medium | |
Meng et al. | Weakly supervised semantic segmentation by a class-level multiple group cosegmentation and foreground fusion strategy | |
CN104573652A (en) | Method, device and terminal for determining identity identification of human face in human face image | |
CN106415594A (en) | A method and a system for face verification | |
Gu et al. | Multiple stream deep learning model for human action recognition | |
CN106372624B (en) | Face recognition method and system | |
CN113298018A (en) | False face video detection method and device based on optical flow field and facial muscle movement | |
Ruan et al. | Dynamic gesture recognition based on improved DTW algorithm | |
CN109902547A (en) | Action recognition method and device | |
CN108256463B (en) | Mobile robot scene recognition method based on ESN neural network | |
CN114708613B (en) | Behavior recognition method, behavior recognition device, computer equipment and storage medium | |
CN115349142A (en) | Image processing apparatus, image processing method, and computer-readable storage medium | |
CN114373194A (en) | Human action recognition method based on key frame and attention mechanism | |
CN112765357A (en) | Text classification method and device and electronic equipment | |
CN112131944B (en) | Video behavior recognition method and system | |
CN112818958B (en) | Action recognition method, device and storage medium | |
Liu et al. | A novel method for temporal action localization and recognition in untrimmed video based on time series segmentation | |
CN114882334B (en) | Method for generating pre-training model, model training method and device | |
CN116631060A (en) | Gesture recognition method and device based on single frame image | |
CN114863570A (en) | Training and recognition method, device and medium of video motion recognition model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |