CN110263743B - Method and device for recognizing images - Google Patents

Method and device for recognizing images Download PDF

Info

Publication number
CN110263743B
CN110263743B CN201910558852.5A CN201910558852A CN110263743B CN 110263743 B CN110263743 B CN 110263743B CN 201910558852 A CN201910558852 A CN 201910558852A CN 110263743 B CN110263743 B CN 110263743B
Authority
CN
China
Prior art keywords
video
frames
palm
action
hand
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910558852.5A
Other languages
Chinese (zh)
Other versions
CN110263743A (en
Inventor
卢艺帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN201910558852.5A priority Critical patent/CN110263743B/en
Publication of CN110263743A publication Critical patent/CN110263743A/en
Application granted granted Critical
Publication of CN110263743B publication Critical patent/CN110263743B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Embodiments of the present disclosure disclose methods and apparatus for identifying images. One embodiment of the method comprises the following steps: acquiring a target video; extracting two frames of video frames containing palm objects from the target video; based on the two frames of video frames, identifying whether the action of the palm corresponding to the target video is a hand-engaging action; in response to determining that the action of the palm corresponding to the target video is a hand-engaging action, generating a recognition result for indicating that the action of the palm corresponding to the target video is a hand-engaging action. According to the embodiment, the electronic equipment can determine whether the video contains the palm object for executing the hand-engaging action, and further, richer human body posture information is identified.

Description

Method and device for recognizing images
Technical Field
Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method and apparatus for recognizing an image.
Background
Image recognition refers to a technique of processing, analyzing, and understanding an image with a computer to recognize various targets and objects.
In the human-computer interaction scenario, more and more researchers are devoting to the research of interaction technologies of hands. Compared with other human body parts, the hand is free and flexible, a great deal of interaction work is carried out in the daily life of the user, and the number of operations completed by the hand is not counted. As can be seen, there is a need in the art to identify semantic information for images or videos containing hand objects.
Disclosure of Invention
The present disclosure proposes methods and apparatus for identifying images.
In a first aspect, embodiments of the present disclosure provide a method for identifying an image, the method comprising: acquiring a target video; extracting two frames of video frames containing palm objects from the target video; based on the two frames of video frames, identifying whether the action of the palm corresponding to the target video is a hand-engaging action; in response to determining that the action of the palm corresponding to the target video is a hand-engaging action, generating a recognition result for indicating that the action of the palm corresponding to the target video is a hand-engaging action.
In some embodiments, identifying whether the motion of the palm corresponding to the target video is a hand-engaging motion based on the two frames of video frames includes: and inputting the two frames of video frames into a pre-trained recognition model to determine whether the motion of the palm corresponding to the target video is a hand-engaging motion, wherein the recognition model is used for recognizing whether the motion of the palm corresponding to the video containing the two frames of video frames is the hand-engaging motion.
In some embodiments, identifying whether the motion of the palm corresponding to the target video is a hand-engaging motion based on the two frames of video frames includes: determining the projection coordinates of normal vectors of palms corresponding to the two frames of video frames respectively on a predetermined projection plane; based on the two determined projection coordinates, whether the motion of the palm corresponding to the target video is a motion of a hand is identified.
In some embodiments, the projection plane is an imaging plane of the video frame; and based on the two determined projection coordinates, identifying whether the motion of the palm corresponding to the target video is a hand-engaging motion, including: and in response to the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset first direction being smaller than or equal to a preset first threshold value, and the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset second direction being larger than or equal to a preset second threshold value, recognizing the action of the palm corresponding to the target video as a hand-engaging action.
In some embodiments, the projection plane is an imaging plane of the video frame; and based on the two determined projection coordinates, identifying whether the motion of the palm corresponding to the target video is a hand-engaging motion, including: and in response to the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset first direction being larger than a preset first threshold value, or the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset second direction being smaller than a preset second threshold value, identifying the action of the palm corresponding to the target video as a non-hand-engaging action.
In some embodiments, extracting a two-frame video frame containing a palm object from a target video includes: two adjacent frames of video frames containing palm objects are extracted from the target video.
In some embodiments, the method further comprises: responding to the generated identification result, and carrying out image fusion on the image containing the target object and the video frame aiming at the video frame containing the palm object in the target video to obtain a fused image corresponding to the video frame; and generating a fused video, and adopting the fused video to replace the target video for presentation.
In a second aspect, embodiments of the present disclosure provide an apparatus for recognizing an image, the apparatus comprising: an acquisition unit configured to acquire a target video; an extraction unit configured to extract two frames of video frames containing palm objects from a target video; the identifying unit is configured to identify whether the action of the palm corresponding to the target video is a hand-engaging action or not based on the two frames of video frames; the first generation unit is configured to generate a recognition result for indicating that the action of the palm corresponding to the target video is the hand-engaging action in response to determining that the action of the palm corresponding to the target video is the hand-engaging action.
In some embodiments, the identification unit comprises: the input module is configured to input two frames of video frames into a pre-trained recognition model to determine whether the action of the palm corresponding to the target video is a hand-engaging action, wherein the recognition model is used for recognizing whether the action of the palm corresponding to the video containing the two frames of video frames is the hand-engaging action.
In some embodiments, the identification unit comprises: the determining module is configured to determine projection coordinates of normal vectors of palms corresponding to the two frames of video frames respectively on a predetermined projection plane; and the identifying module is configured to identify whether the action of the palm corresponding to the target video is a hand-engaging action based on the two determined projection coordinates.
In some embodiments, the projection plane is an imaging plane of the video frame; the identification module comprises: the first recognition sub-module is configured to recognize the motion of the palm corresponding to the target video as a hand-engaging motion in response to the absolute value of the difference between the coordinate values of the projection coordinates of the two frames in the preset first direction being smaller than or equal to a preset first threshold value and the absolute value of the difference between the coordinate values of the projection coordinates corresponding to the two frames in the preset second direction being larger than or equal to a preset second threshold value.
In some embodiments, the projection plane is an imaging plane of the video frame; the identification module comprises: the second recognition sub-module is configured to recognize the motion of the palm corresponding to the target video as a non-hand-engaging motion in response to the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset first direction being greater than a preset first threshold or the absolute value of the difference between the coordinate values of the projection coordinates corresponding to the two frames of video frames in the preset second direction being less than a preset second threshold.
In some embodiments, the extraction unit comprises: and the extraction module is configured to extract two adjacent frames of video frames containing the palm object from the target video.
In some embodiments, the apparatus further comprises: the fusion unit is configured to respond to the generation of the identification result, and for a video frame containing a palm object in the target video, the image containing the target object is subjected to image fusion with the video frame to obtain a fused image corresponding to the video frame; and the second generation unit is configured to generate the fused video and to replace the target video with the fused video for presentation.
In a third aspect, embodiments of the present disclosure provide an electronic device for recognizing an image, comprising: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement a method as in any of the embodiments of the method for identifying images described above.
In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium for identifying an image, having stored thereon a computer program which, when executed by a processor, implements a method as in any of the embodiments of the method for identifying an image described above.
According to the method and the device for identifying the image, the target video is obtained, then two frames of video frames containing palm objects are extracted from the target video, then, based on the two frames of video frames, whether the action of the palm corresponding to the target video is a hand-engaging action or not is identified, finally, in response to the fact that the action of the palm corresponding to the target video is the hand-engaging action is determined, an identification result for indicating that the action of the palm corresponding to the target video is the hand-engaging action is generated, and therefore, whether the video contains the palm objects for executing the hand-engaging action or not can be determined by the electronic device, further, richer palm gesture information is identified, and human-computer interaction based on gesture identification is facilitated.
Drawings
Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:
FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied;
FIG. 2 is a flow chart of one embodiment of a method for identifying images according to the present disclosure;
FIG. 3 is a schematic illustration of one application scenario of a method for identifying images according to the present disclosure;
FIG. 4 is a flow chart of yet another embodiment of a method for identifying images according to the present disclosure;
FIG. 5 is a schematic structural view of one embodiment of an apparatus for recognizing an image according to the present disclosure;
fig. 6 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of a method for recognizing an image or an apparatus for recognizing an image of the present disclosure may be applied.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or transmit data (e.g., video) or the like. Various client applications, such as video playing software, news information class applications, image processing class applications, web browser applications, shopping class applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, various electronic devices having video capturing functions may be used, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert compression standard audio plane 4) players, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.
The server 105 may be a server providing various services, such as a background server processing video (e.g., target video) transmitted by the terminal devices 101, 102, 103. The background server may perform processing such as recognition on received data such as video, and generate a processing result (e.g., recognition result). Optionally, the background server may also feed back the processing result to the terminal device. As an example, the server 105 may be a cloud server or a physical server.
The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services), or as a single software or software module. The present invention is not particularly limited herein.
It should also be noted that, the method for identifying an image provided by the embodiment of the present disclosure may be performed by a server, may be performed by a terminal device, or may be performed by the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit, sub-unit, module, sub-module) included in the image recognition apparatus may be disposed in the server, may be disposed in the terminal device, or may be disposed in the server and the terminal device, respectively.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the method for recognizing an image is run does not need to perform data transmission with other electronic devices in the course of performing the method, the system architecture may include only the electronic device (e.g., a server or a terminal device) on which the method for recognizing an image is run.
With continued reference to fig. 2, a flow 200 of one embodiment of a method for identifying images according to the present disclosure is shown. The method for recognizing an image includes the steps of:
in step 201, a target video is acquired.
In the present embodiment, an execution subject of the method for recognizing an image (e.g., a server or a terminal device shown in fig. 1) may acquire a target video from other electronic devices or locally through a wired connection or a wireless connection.
The target video may be a video to be subjected to palm motion recognition.
As an example, the target video may be a video obtained by photographing the palm of the user by the execution subject or an electronic device communicatively connected to the execution subject.
It will be appreciated that when the target video is a video obtained by photographing the palm of the user, a palm object may be included in some or all of the video frames in the target video. Wherein the palm object may be an image of the palm presented in a video frame (i.e., image).
Step 202, two frames of video frames containing palm objects are extracted from the target video.
In this embodiment, the execution subject may extract two frames of video frames including the palm object from the target video acquired in step 201.
In step 202, the number of video frames extracted from the target video by the execution subject may be 2 or any natural number greater than 2. Embodiments of the present disclosure are not limited in this regard.
It is understood that, when the number of video frames extracted from the target video by the execution body is any natural number greater than 2, two frames of video frames including the palm object are necessarily extracted from the target video, and thus, a scheme that the number of video frames extracted from the target video by the execution body is any natural number greater than 2 is also included in the scope of the technical solution claimed in the embodiments of the present disclosure.
As an example, the above-described execution body may execute this step 202 in the following manner:
first, a video frame in a target video is identified to determine whether the video frame contains a palm object, and if the video frame contains a palm object, the video frame is marked.
Then, two frames of video frames are randomly selected from the marked video frames, or 1 frame of video frame is randomly selected from the marked video frames, and then a video frame which is separated from the video frames by a predetermined natural number (for example, 0,1,2,3, etc.) and is positioned behind the video frame is selected from the video, so that two frames of video frames containing palm objects are extracted from the target video.
In some optional implementations of this embodiment, the foregoing execution body may further execute the step 202 in the following manner: two adjacent frames of video frames containing palm objects are extracted from the target video. Here, the adjacent two-frame video frames may be two-frame video frames between which no video frame containing the palm object exists (but between which no video frame containing the palm object exists), or may be two-frame video frames between which no video frame exists.
As an example, the execution subject may randomly extract two adjacent frames of video frames containing the palm object from the target video, or may extract two adjacent frames of video frames containing the palm object that occur first, that is, a first frame of video frames and a second frame of video frames containing the palm object from the target video.
It can be appreciated that, because the motion of the hand is flexible and the motion speed is fast, when the two extracted frames of video are adjacent, the alternative implementation mode can estimate the relative movement distance of the palm corresponding to the palm object in the two frames of video, so that the subsequent step can more accurately identify whether the motion of the palm is a motion of the hand.
Step 203, based on the two frames of video frames, it is identified whether the palm motion corresponding to the target video is a motion of a hand.
In this embodiment, the executing body may identify whether the motion of the palm corresponding to the target video acquired in step 201 is a motion of a hand, based on the two frames of video extracted in step 202. The palm corresponding to the target video may be a palm photographed in the process of obtaining the target video.
In some optional implementations of this embodiment, the foregoing execution body may execute the step 203 in the following manner:
and inputting the two frames of video frames into a pre-trained recognition model to determine whether the motion of the palm corresponding to the target video is a hand-engaging motion. The recognition model is used for recognizing whether the motion of the palm corresponding to the video containing the two inputted frames of video frames is a motion of the hand.
As an example, the identification model may be a two-dimensional table or database storing a plurality of two-frame video frames in association with information indicating whether or not an action of a palm corresponding to a video including each of the plurality of two-frame video frames is a hand-engaging action.
As yet another example, the recognition model may be a convolutional neural network model that is trained based on a training sample set using a machine learning algorithm. The training samples in the training sample set may include two frames of video frames, and information for indicating whether the motion of the palm corresponding to the video including the two frames of video frames is a hand-engaging motion.
In some optional implementations of this embodiment, the foregoing execution body may also execute the step 203 in the following manner:
And determining the projection coordinates of the normal vector of the palm corresponding to each of the two frames of video frames on a predetermined projection plane.
The projection plane may be any predetermined plane in three-dimensional space, for example, any plane parallel to an imaging plane of a video frame (i.e., an image plane of an image acquisition device such as a camera). The projection coordinates may be characterized by a first element and a second element. The first element may be a coordinate value of the projection coordinate in a preset first direction. The second element may be a coordinate value of the projection coordinate in a preset second direction. As an example, the projection coordinates may be expressed as "(x, y)". Where x may be a first element and y may be a second element. The normal vector of the palm may be oriented perpendicular to the plane in which the palm lies toward the palm's palm side. The normal vector of the palm may be a predetermined value.
It will be appreciated that since the projection plane is a two-dimensional plane, the projection coordinates need only be characterized by both the first element and the second element. However, in some cases, the projection coordinates may also be characterized by more elements.
And secondly, based on the two determined projection coordinates, identifying whether the action of the palm corresponding to the target video is a hand-engaging action.
Specifically, the above-described execution body may execute the second step in the following manner:
and inputting the two determined projection coordinates into a pre-trained classification model, and generating discrimination information for indicating whether the motion of the palm corresponding to the video corresponding to the two projection coordinates is a hand-engaging motion.
The classification model can be trained by the following steps:
first, a training sample set is acquired. The training samples in the training sample set comprise two projection coordinates and judging information for indicating whether the motion of the palm corresponding to the video corresponding to the two projection coordinates is a hand-engaging motion or not. The following relationship exists between "two projection coordinates" and "video" in the video corresponding to "two projection coordinates: and determining two projection coordinates of normal vectors of palms corresponding to two frames of video frames containing the palms in the video in a predetermined projection plane respectively, so as to obtain the two projection coordinates.
Then, a machine learning algorithm is adopted, two projection coordinates included in the training samples in the training sample set are used as input data of an initial model, discrimination information corresponding to the two input projection coordinates is used as expected output data of the initial model, whether the initial model meets a predetermined training ending condition is determined, and if the training ending condition is met, the initial model meeting the training ending condition is determined to be a classification model.
Wherein the initial model may include: convolutional layers, classifiers, pooling layers, and so forth. The training end condition may include, but is not limited to, at least one of: the training time exceeds a preset duration, the training times exceeds a preset number, and a function value of a predetermined loss function calculated based on actual output data and expected output data is smaller than a preset threshold. Here, the actual output data is data obtained by inputting input data into the initial model and calculating the initial model.
In some optional implementations of this embodiment, the projection plane is an imaging plane of the video frame. Thus, the execution body may execute the second step as follows:
and in response to the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset first direction being smaller than or equal to a preset first threshold value, and the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset second direction being larger than or equal to a preset second threshold value, recognizing the action of the palm corresponding to the target video as a hand-engaging action.
In some optional implementations of this embodiment, the projection plane is an imaging plane of the video frame. Thus, the execution body may execute the second step as follows:
And in response to the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset first direction being larger than a preset first threshold value, or the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset second direction being smaller than a preset second threshold value, identifying the action of the palm corresponding to the target video as a non-hand-engaging action.
In some optional implementations of this embodiment, the execution entity may generate projection coordinates corresponding to the video frame (e.g., the video frame of the two extracted video frames) in the following manner:
and inputting the video frame into a pre-trained projection coordinate determination model to obtain the projection coordinate corresponding to the video frame. The projection coordinate determination model may be used to determine projection coordinates corresponding to an input video frame. The projection coordinates corresponding to the video frame are projection coordinates of a normal vector of a palm corresponding to a palm object in the video frame on a predetermined projection plane.
As an example, the projection coordinate determining model may be a model obtained by training based on a training sample set by using a machine learning algorithm, or may be a formula for obtaining projection coordinates by performing normal vector calculation on a plane where a palm corresponding to a palm object in a video frame is located and then projecting the obtained normal vector by using a geometric algorithm.
The training samples in the training sample set include video frames and projection coordinates corresponding to the video frames. The projection coordinates corresponding to each video frame in the training sample set can be obtained by manual annotation; or can be obtained by the following way:
first, a palm-rotated video is acquired. The coordinate values along the first direction in the projected coordinates of the palm normal vector in each video frame are the same, and the coordinate values along the second direction in the projected coordinates of the palm normal vector in each video frame are continuously changed. The projected coordinates are projected coordinates of the palm normal vector on a preset plane (e.g., an imaging plane). The video content of palm rotation can reflect the variation trend of the palm normal vector. Here, the video of the palm rotation may be photographed by various devices (e.g., the above-described execution subject or other electronic devices communicatively connected to the above-described execution subject). Wherein, in the course of palm rotation, can keep the rotation direction fixed, the palm does not shift in the direction along the pivot. The direction of the rotating shaft can be the direction of the arm. For example, the palm and the fingers are unfolded, so that the palm and the fingers are on the same plane, and the direction of the fingers is vertical. When the plane of the palm is perpendicular to the screen direction, the initial position of the palm is set, then the palm center rotates towards the screen direction, the plane of the palm is rotated to be parallel to the screen, the rotation is continued according to the original rotation direction until the plane of the palm is perpendicular to the screen direction again, and the palm does not displace in the vertical direction in the palm rotation process. The palm-rotated video contains a plurality of video frames. The coordinate values along the first direction in the projected coordinates of the palm normal vector in each video frame of the palm rotation video are the same, and the coordinate values along the second direction in the projected coordinates of the palm normal vector in the image of the video continuously change, wherein the continuously change can be understood as continuously increasing or continuously decreasing. The projected coordinates of the palm normal vector are coordinates of the palm normal vector in a preset plane, which may be an imaging plane. And obtaining the projection coordinates of the palm normal vector in each video frame along the second direction by adopting a similar method, wherein the coordinate values of the palm normal vector in the projection coordinates of the palm normal vector in each video frame along the first direction are the same, and the coordinate values of the palm normal vector in each video frame along the first direction are continuously changed. Here, the first direction and the second direction may be any directions. Alternatively, the line in which the first direction is located and the line in which the second direction is located may be perpendicular. As an example, the first direction may be a direction parallel to the ground in the projection plane, and the second direction may be a direction perpendicular to the ground in the projection plane.
Then, each video frame contained in the video and the projection coordinates corresponding to the video frame can be formed into a training sample, so that a training sample set is obtained.
It can be understood that, because the method of manual annotation is difficult to accurately annotate the projection coordinates of the video frame, the accuracy of the trained model is low, and the method of manual annotation is troublesome to operate and the training efficiency of the model is low. Therefore, the training sample set obtained by the non-manual labeling mode is used for training the projection coordinate determination model, so that the accuracy of generating the projection coordinate by the model can be improved, and the training efficiency of the model is improved.
In step 204, in response to determining that the motion of the palm corresponding to the target video is a hand-engaging motion, a recognition result is generated for indicating that the motion of the palm corresponding to the target video is a hand-engaging motion.
In this embodiment, in a case where it is determined that the motion of the palm corresponding to the target video is a hand-engaging motion, the execution body may generate a recognition result for indicating that the motion of the palm corresponding to the target video is the hand-engaging motion.
In some optional implementations of this embodiment, after performing step 204, the foregoing execution body may further perform the following steps:
First, in response to generating an identification result, for a video frame containing a palm object in a target video, performing image fusion on an image containing the target object and the video frame to obtain a fused image corresponding to the video frame. The target object may be various objects, such as text for indicating an incoming call, and the like.
Then, generating a fused video, and adopting the fused video to replace the target video for presenting.
It should be noted that, before presenting the fused video, the execution subject may or may not present the target video, which is not limited herein.
In some optional implementations of this embodiment, after the step 204 is performed by the execution body, the following steps may be further performed: the control target device moves to the position where the execution subject is located. The target device may be various electronic devices, such as a mobile robot, which are communicatively connected to the execution body.
It can be appreciated that in this alternative implementation manner, when it is recognized that the palm motion corresponding to the target video is a hand-engaging motion, the target device is controlled to approach the execution body, so that man-machine interaction based on the hand-engaging motion recognition is achieved.
With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for recognizing an image according to the present embodiment. In the application scenario of fig. 3, the handset 301 first acquires the target video 3011. Then, the cellular phone 301 extracts two-frame video frames (video frames 30111 and 30112 in the drawing) containing the palm object from the target video 3011. Then, the mobile phone 301 recognizes whether or not the palm motion corresponding to the target video 3011 is a motion of a hand, based on the two frames of video frames. Finally, in response to determining that the motion of the palm corresponding to the target video 3011 is a hand-engaging motion, a recognition result indicating that the motion of the palm corresponding to the target video 3011 is a hand-engaging motion is generated. As an example, in the drawing, the mobile phone 301 generates an identification result "is a hand-engaging action" that the action of the palm corresponding to the target video 3011 is a hand-engaging action.
In the prior art, there is generally no technical solution for identifying whether a palm object performing a hand-engaging action is included in a video. However, since the hands are more free and flexible than other human body parts, a great deal of interaction is carried out in the daily life of the user, and the number of operations performed by the hands is not counted. As can be seen, there is a need in the art to identify semantic information of an image or video containing a hand object, for example, to identify whether a palm corresponding to the image or video is performing a hand engaging action.
According to the method provided by the embodiment of the invention, the target video is obtained, then, two frames of video frames containing palm objects are extracted from the target video, then, based on the two frames of video frames, whether the action of the palm corresponding to the target video is a hand-engaging action or not is identified, finally, in response to determining that the action of the palm corresponding to the target video is the hand-engaging action, the identification result for indicating that the action of the palm corresponding to the target video is the hand-engaging action is generated, so that the electronic equipment can determine whether the video contains the palm objects for executing the hand-engaging action or not, further, richer palm gesture information is identified, and the human-computer interaction based on gesture identification is facilitated.
With further reference to fig. 4, a flow 400 of yet another embodiment of a method for identifying an image is shown. The process 400 of the method for recognizing an image comprises the steps of:
step 401, obtaining a target video. Thereafter, step 402 is performed.
In this embodiment, step 401 is substantially identical to step 201 in the corresponding embodiment of fig. 2, and will not be described herein.
Step 402, two frames of video frames containing palm objects are extracted from the target video. Thereafter, step 403 is performed.
In this embodiment, step 402 is substantially identical to step 202 in the corresponding embodiment of fig. 2, and will not be described herein.
Step 403, determining the projection coordinates of the normal vector of the palm corresponding to each of the two frames of video frames on the predetermined projection plane. Thereafter, step 404 is performed.
In this embodiment, the execution body may determine projection coordinates of normal vectors of palms corresponding to the two frames of video frames respectively on a predetermined projection plane. The projection plane is the imaging plane of the video frame.
Here, the execution body or the electronic device communicatively connected to the execution body may perform the step 403 in various manners.
As an example, the executing body may sequentially input the two frames of video frames extracted in the step 402 to a pre-trained palm normal vector determination model, sequentially obtain normal vectors of palms corresponding to each of the two frames of video frames, and sequentially determine projection coordinates of the obtained normal vectors on an imaging plane.
Here, the palm normal vector determination model described above may be used to determine the normal vector of the palm to which the video frame corresponds. As an example, the palm normal vector determination model may be a two-dimensional table or database in which video frames are stored in association with normal vectors of the palm corresponding to the video frames, or may be a convolutional neural network model obtained by training based on a training sample set by using a machine learning algorithm. The training samples in the training sample set comprise video frames and normal vectors of palms corresponding to the video frames.
As another example, the execution subject may sequentially input the two video frames extracted in step 402 to a pre-trained projection coordinate determination model to obtain projection coordinates corresponding to the video frames. The projection coordinate determination model may be used to determine projection coordinates corresponding to an input video frame. The projection coordinates corresponding to the video frame are the projection coordinates of the normal vector of the palm corresponding to the palm object in the video frame on the imaging plane.
As an example, the projection coordinate determining model may be a model obtained by training based on a training sample set by using a machine learning algorithm, or may be a formula for obtaining projection coordinates by performing normal vector calculation on a plane where a palm corresponding to a palm object in a video frame is located and then projecting the obtained normal vector by using a geometric algorithm.
The training samples in the training sample set include video frames and projection coordinates corresponding to the video frames. The projection coordinates corresponding to each video frame in the training sample set can be obtained by manual annotation; or can be obtained by the following way:
first, a palm-rotated video is acquired. The coordinate values along the first direction in the projected coordinates of the palm normal vector in each video frame are the same, and the coordinate values along the second direction in the projected coordinates of the palm normal vector in each video frame are continuously changed. The projected coordinates are projected coordinates of the palm normal vector on a preset plane (e.g., an imaging plane). The video content of palm rotation can reflect the variation trend of the palm normal vector. Here, the video of the palm rotation may be photographed by various devices (e.g., the above-described execution subject or other electronic devices communicatively connected to the above-described execution subject). Wherein, in the course of palm rotation, can keep the rotation direction fixed, the palm does not shift in the direction along the pivot. The direction of the rotating shaft can be the direction of the arm. For example, the palm and the fingers are unfolded, so that the palm and the fingers are on the same plane, and the direction of the fingers is vertical. When the plane of the palm is perpendicular to the screen direction, the initial position of the palm is set, then the palm center rotates towards the screen direction, the plane of the palm is rotated to be parallel to the screen, the rotation is continued according to the original rotation direction until the plane of the palm is perpendicular to the screen direction again, and the palm does not displace in the vertical direction in the palm rotation process. The palm-rotated video contains a plurality of video frames. The coordinate values along the first direction in the projected coordinates of the palm normal vector in each video frame of the palm rotation video are the same, and the coordinate values along the second direction in the projected coordinates of the palm normal vector in the image of the video continuously change, wherein the continuously change can be understood as continuously increasing or continuously decreasing. The projected coordinates of the palm normal vector are coordinates of the palm normal vector in a preset plane, which may be an imaging plane. And obtaining the projection coordinates of the palm normal vector in each video frame along the second direction by adopting a similar method, wherein the coordinate values of the palm normal vector in the projection coordinates of the palm normal vector in each video frame along the first direction are the same, and the coordinate values of the palm normal vector in each video frame along the first direction are continuously changed.
Then, each video frame contained in the video and the projection coordinates corresponding to the video frame can be formed into a training sample, so that a training sample set is obtained.
It can be understood that, because the method of manual annotation is difficult to accurately annotate the projection coordinates of the video frame, the accuracy of the trained model is low, and the method of manual annotation is troublesome to operate and the training efficiency of the model is low. Therefore, the training sample set obtained by the non-manual labeling mode is used for training the projection coordinate determination model, so that the accuracy of generating the projection coordinate by the model can be improved, and the training efficiency of the model is improved.
Step 404, determining whether the projection coordinates of the two frames of video frames meet a predetermined recognition condition. If yes, go to step 405; if not, go to step 406.
In this embodiment, the execution body may determine whether the projection coordinates of the two frames of video frames satisfy a predetermined recognition condition. The sign recognition condition is that an absolute value of a difference between coordinate values of projection coordinates of two frames of video frames in a preset first direction is smaller than or equal to a preset first threshold value, and an absolute value of a difference between coordinate values of projection coordinates of two frames of video frames in a preset second direction is larger than or equal to a preset second threshold value. The first threshold value and the second threshold value may be preset by a technician, and may be equal or different.
Here, in the case where the "above-mentioned sign recognition condition is satisfied," the above-mentioned execution subject may continue to execute step 405; if the recognition condition is not satisfied, the execution subject may execute step 406.
Step 405, the motion of the palm corresponding to the target video is identified as a hand-engaging motion. Thereafter, step 407 is performed.
In this embodiment, in a case where "the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset first direction is smaller than or equal to the preset first threshold value, and the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset second direction is larger than or equal to the preset second threshold value" is satisfied, the execution subject may identify the motion of the palm corresponding to the target video as the hand-engaging motion.
Step 406, recognizing the action of the palm corresponding to the target video as a non-hand-engaging action.
In this embodiment, in a case where "the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset first direction is greater than the preset first threshold value, or the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset second direction is less than the preset second threshold value" is satisfied, the executing body may identify the motion of the palm corresponding to the target video as the non-glaring motion.
In this embodiment, the projection coordinates may be characterized by a first element and a second element. The first element may be a coordinate value of the projection coordinate in the preset first direction, and the second element may be a coordinate value of the projection coordinate in the preset second direction. Thus, when the following formula (i.e., the hand recognition condition) is satisfied, the motion of the palm corresponding to the target video can be recognized as the hand recognition motion:
|d x |≤T 1 and, |d y |≥T 2
Wherein d x The difference between the first elements representing the projection coordinates of the two frames of video when the projection coordinates of the two frames of video are "(x) t ,y t ) Sum (x) t-1 ,y t-1 ) "at the time d x Has a value of x t And x t-1 And (3) a difference. Wherein x is t Is the first element of the projection coordinates of one of the frames of video. X is x t-1 Is the first element of the projection coordinates of another frame of video frames. d, d y A second element representing a difference between projection coordinates of the two frames of video when the projection coordinates of the two frames of video are "(x) t ,y t ) Sum (x) t-1 ,y t-1 ) "at the time d y Has a value of y t And y is t-1 And (3) a difference. Wherein y is t For projection coordinates of one of the video framesSecond element, y t-1 Is the second element of the projection coordinates of another frame of video frames.
In addition, when the above formula is not satisfied, the motion of the palm corresponding to the target video may be recognized as a non-motion.
Step 407, generating a recognition result for indicating that the action of the palm corresponding to the target video is a hand-engaging action.
In this embodiment, the execution body may generate the recognition result for indicating that the motion of the palm corresponding to the target video is the motion of the hand.
It should be noted that, in addition to the above, the present embodiment may further include the same or similar features and effects as those of the embodiment corresponding to fig. 2, which are not described herein.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the process 400 of the method for recognizing an image in this embodiment highlights the step of recognizing whether the motion of the palm corresponding to the target video is a motion of a hand in which the motion is a motion of a hand in which the palm is a motion of a hand. Therefore, the scheme described in the embodiment can more accurately and rapidly identify whether the motion of the palm corresponding to the video is a hand-engaging motion.
With further reference to fig. 5, as an implementation of the method shown in fig. 2 described above, the present disclosure provides an embodiment of an apparatus for recognizing an image, which corresponds to the method embodiment shown in fig. 2, and which may include the same or corresponding features as the method embodiment shown in fig. 2, in addition to the features described below, and produces the same or corresponding effects as the method embodiment shown in fig. 2. The device can be applied to various electronic equipment.
As shown in fig. 5, the apparatus 500 for recognizing an image of the present embodiment includes: an acquisition unit 501, an extraction unit 502, an identification unit 503, and a first generation unit 504. Wherein the acquisition unit 501 is configured to acquire a target video; the extracting unit 502 is configured to extract two-frame video frames containing palm objects from the target video; the identifying unit 503 is configured to identify whether the motion of the palm corresponding to the target video is a motion of a hand, based on the two frames of video frames; the first generation unit 504 is configured to generate, in response to determining that the action of the palm corresponding to the target video is a hand-engaging action, a recognition result for indicating that the action of the palm corresponding to the target video is a hand-engaging action.
In the present embodiment, the acquisition unit 501 of the apparatus 500 for recognizing an image may acquire a target video from other electronic devices or locally through a wired connection or a wireless connection. The target video may be a video to be subjected to palm motion recognition.
In the present embodiment, the extraction unit 502 may extract two frames of video frames including a palm object from the target video acquired by the acquisition unit 501.
In this embodiment, the identifying unit 503 may identify whether the motion of the palm corresponding to the target video acquired by the acquiring unit 501 is a motion of a hand, based on the two frames of video extracted by the extracting unit 502. The palm corresponding to the target video may be a palm photographed in the process of obtaining the target video.
In this embodiment, in response to determining that the motion of the palm corresponding to the target video acquired by the acquisition unit 501 is a hand-engaging motion, the first generation unit 504 may generate a recognition result for indicating that the motion of the palm corresponding to the target video is a hand-engaging motion.
In some optional implementations of this embodiment, the identifying unit 503 includes: the input module (shown in the figure) is configured to input two frames of video frames into a pre-trained recognition model to determine whether the motion of the palm corresponding to the target video is a hand-engaging motion, wherein the recognition model is used for recognizing whether the motion of the palm corresponding to the video containing the two frames of video frames is a hand-engaging motion.
In some optional implementations of the present embodiment, the identifying unit 503 includes: the determining module (shown in the figure) is configured to determine projection coordinates of normal vectors of palms corresponding to the two frames of video frames respectively on a predetermined projection plane; and an identification module (shown in the figure) is configured to identify whether the motion of the palm corresponding to the target video is a motion of a hand, based on the determined two projection coordinates.
In some optional implementations of this embodiment, the projection plane is an imaging plane of the video frame; the identification module comprises: the first recognition sub-module (shown in the figure) is configured to recognize the motion of the palm corresponding to the target video as a pointing motion in response to the absolute value of the difference between the coordinate values of the projected coordinates of the two frames of video frames in the preset first direction being less than or equal to a preset first threshold value and the absolute value of the difference between the coordinate values of the projected coordinates corresponding to the two frames of video frames in the preset second direction being greater than or equal to a preset second threshold value.
In some optional implementations of this embodiment, the projection plane is an imaging plane of the video frame; the identification module comprises: the second recognition sub-module (shown in the figure) is configured to recognize the motion of the palm corresponding to the target video as a non-motion in response to the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset first direction being greater than a preset first threshold, or the absolute value of the difference between the coordinate values of the projection coordinates corresponding to the two frames of video frames in the preset second direction being less than a preset second threshold.
In some optional implementations of the present embodiment, the extraction unit 502 includes: an extraction module (shown in the figure) is configured to extract from the target video adjacent two frames of video frames containing palm objects.
In some optional implementations of this embodiment, the apparatus 500 further includes: the fusion unit (shown in the figure) is configured to respond to the generation of the identification result, and for a video frame containing a palm object in a target video, perform image fusion on an image containing the target object and the video frame to obtain a fused image corresponding to the video frame; and a second generation unit (shown in the figure) configured to generate the fused video and to render with the fused video instead of the target video.
The apparatus for recognizing an image provided in the foregoing embodiment of the present disclosure acquires a target video through the acquisition unit 501, then the extraction unit 502 extracts two frames of video frames including a palm object from the target video, then the recognition unit 503 recognizes whether an action of a palm corresponding to the target video is a hand-engaging action based on the two frames of video frames, and finally, the first generation unit 504 generates a recognition result indicating that the action of the palm corresponding to the target video is the hand-engaging action in response to determining that the action of the palm corresponding to the target video is the hand-engaging action, thereby, the electronic apparatus can determine whether the video includes the palm object performing the hand-engaging action, and further, recognizes richer palm posture information, which is helpful for realizing man-machine interaction based on gesture recognition.
Referring now to fig. 6, a schematic diagram of an electronic device (e.g., server or terminal device of fig. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), car terminals (e.g., car navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The terminal device/server illustrated in fig. 6 is merely an example, and should not impose any limitation on the functionality and scope of use of embodiments of the present disclosure.
As shown in fig. 6, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 6 may represent one device or a plurality of devices as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 601.
It should be noted that, the computer readable medium according to the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Whereas in embodiments of the present disclosure, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a target video; extracting two frames of video frames containing palm objects from the target video; based on the two frames of video frames, identifying whether the action of the palm corresponding to the target video is a hand-engaging action; in response to determining that the action of the palm corresponding to the target video is a hand-engaging action, generating a recognition result for indicating that the action of the palm corresponding to the target video is a hand-engaging action.
Computer program code for carrying out operations of embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments described in the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, an extraction unit, an identification unit, and a first generation unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, the acquisition unit may also be described as "a unit that acquires a target video".
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention referred to in this disclosure is not limited to the specific combination of features described above, but encompasses other embodiments in which features described above or their equivalents may be combined in any way without departing from the spirit of the invention. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Claims (8)

1. A method for identifying an image, comprising:
acquiring a target video;
extracting two adjacent frames of video frames containing palm objects from the target video;
based on the two frames of video frames, identifying whether the action of the palm corresponding to the target video is a hand-engaging action;
in response to determining that the motion of the palm corresponding to the target video is a hand-engaging motion, generating a recognition result for indicating that the motion of the palm corresponding to the target video is a hand-engaging motion;
controlling target equipment to move to a position where an execution main body of the method is located, wherein the target equipment is a mobile robot in communication connection with the execution main body;
The identifying whether the palm motion corresponding to the target video is a hand-engaging motion based on the two frames of video frames comprises:
determining projection coordinates of normal vectors of palms corresponding to the two frames of video frames respectively on a predetermined projection plane, wherein the projection plane is an imaging plane of the video frames;
responding to the fact that the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in a preset first direction is smaller than or equal to a preset first threshold value, and the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in a preset second direction is larger than or equal to a preset second threshold value, identifying the action of the palm corresponding to the target video as a hand-engaging action, wherein the first direction is a direction parallel to the ground in a projection plane, and the second direction is a direction perpendicular to the ground in the projection plane;
responding to the generation of the identification result, and carrying out image fusion on an image containing a target object and a video frame containing a palm object in the target video to obtain a fused image corresponding to the video frame; the target object is a word for indicating calling;
generating a fused video, and replacing the target video with the fused video for presentation.
2. The method of claim 1, wherein the identifying, based on the two frames of video frames, whether the motion of the palm corresponding to the target video is a hand-engaging motion comprises:
and inputting the two frames of video frames into a pre-trained recognition model to determine whether the action of the palm corresponding to the target video is a hand-engaging action, wherein the recognition model is used for recognizing whether the action of the palm corresponding to the video containing the two frames of video frames is the hand-engaging action.
3. The method of claim 1, wherein the projection plane is an imaging plane of a video frame; and
based on the two determined projection coordinates, identifying whether the motion of the palm corresponding to the target video is a hand-engaging motion, including:
and in response to the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset first direction being larger than a preset first threshold value, or the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in the preset second direction being smaller than a preset second threshold value, identifying the action of the palm corresponding to the target video as a non-glaring action.
4. An apparatus for identifying an image, comprising:
An acquisition unit configured to acquire a target video;
an extraction unit configured to extract, from the target video, two adjacent frames of video frames containing palm objects;
the identifying unit is configured to identify whether the action of the palm corresponding to the target video is a hand-engaging action or not based on the two frames of video frames;
a first generation unit configured to generate, in response to determining that the action of the palm corresponding to the target video is a hand-engaging action, a recognition result for indicating that the action of the palm corresponding to the target video is a hand-engaging action;
a control unit configured to control a target device to move to a position where the apparatus is located, the target device being a mobile robot communicatively connected to the apparatus;
the identification unit includes:
the determining module is configured to determine the projection coordinates of the normal vector of the palm corresponding to each of the two frames of video frames on a predetermined projection plane, wherein the projection plane is an imaging plane of the video frames;
a first recognition sub-module configured to respond to the absolute value of the difference between the coordinate values of the projection coordinates of the two frames of video frames in a preset first direction being smaller than or equal to a preset first threshold value,
the absolute value of the difference between the coordinate values of the projection coordinates corresponding to the two frames of video frames in a preset second direction is larger than or equal to a preset second threshold value, the motion of the palm corresponding to the target video is identified as a hand-engaging motion, the first direction is a direction parallel to the ground in a projection plane, and the second direction is a direction perpendicular to the ground in the projection plane;
The fusion unit is configured to respond to the generation of the identification result, and for a video frame containing a palm object in the target video, the fusion unit is used for carrying out image fusion on the image containing the target object and the video frame to obtain a fused image corresponding to the video frame; the target object is a word for indicating calling;
and the second generation unit is configured to generate a fused video and replace the target video with the fused video for presentation.
5. The apparatus of claim 4, wherein the identification unit comprises:
the input module is configured to input the two frames of video frames into a pre-trained recognition model to determine whether the action of the palm corresponding to the target video is a hand-engaging action, wherein the recognition model is used for recognizing whether the action of the palm corresponding to the video containing the two frames of video frames is the hand-engaging action.
6. The apparatus of claim 4, wherein the projection plane is an imaging plane of a video frame; and
the identification unit includes:
the second recognition sub-module is configured to recognize the motion of the palm corresponding to the target video as a non-motion in response to the absolute value of the difference between the coordinate values of the projection coordinates of the two frames in the preset first direction being greater than a preset first threshold or the absolute value of the difference between the coordinate values of the projection coordinates corresponding to the two frames in the preset second direction being less than a preset second threshold.
7. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-3.
8. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-3.
CN201910558852.5A 2019-06-26 2019-06-26 Method and device for recognizing images Active CN110263743B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910558852.5A CN110263743B (en) 2019-06-26 2019-06-26 Method and device for recognizing images

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910558852.5A CN110263743B (en) 2019-06-26 2019-06-26 Method and device for recognizing images

Publications (2)

Publication Number Publication Date
CN110263743A CN110263743A (en) 2019-09-20
CN110263743B true CN110263743B (en) 2023-10-13

Family

ID=67921630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910558852.5A Active CN110263743B (en) 2019-06-26 2019-06-26 Method and device for recognizing images

Country Status (1)

Country Link
CN (1) CN110263743B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143613B (en) * 2019-12-30 2024-02-06 携程计算机技术(上海)有限公司 Method, system, electronic device and storage medium for selecting video cover

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6944315B1 (en) * 2000-10-31 2005-09-13 Intel Corporation Method and apparatus for performing scale-invariant gesture recognition
CN105794191A (en) * 2013-12-17 2016-07-20 夏普株式会社 Recognition data transmission device
CN105809144A (en) * 2016-03-24 2016-07-27 重庆邮电大学 Gesture recognition system and method adopting action segmentation
CN107438804A (en) * 2016-10-19 2017-12-05 深圳市大疆创新科技有限公司 A kind of Wearable and UAS for being used to control unmanned plane
CN108345387A (en) * 2018-03-14 2018-07-31 百度在线网络技术(北京)有限公司 Method and apparatus for output information
CN108549490A (en) * 2018-05-03 2018-09-18 林潼 A kind of gesture identification interactive approach based on Leap Motion equipment
CN109455180A (en) * 2018-11-09 2019-03-12 百度在线网络技术(北京)有限公司 Method and apparatus for controlling unmanned vehicle
CN109597485A (en) * 2018-12-04 2019-04-09 山东大学 A kind of gesture interaction system and its working method based on two fingers angular domain feature
CN109635767A (en) * 2018-12-20 2019-04-16 北京字节跳动网络技术有限公司 A kind of training method, device, equipment and the storage medium of palm normal module
CN109889892A (en) * 2019-04-16 2019-06-14 北京字节跳动网络技术有限公司 Video effect adding method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10152136B2 (en) * 2013-10-16 2018-12-11 Leap Motion, Inc. Velocity field interaction for free space gesture interface and control

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6944315B1 (en) * 2000-10-31 2005-09-13 Intel Corporation Method and apparatus for performing scale-invariant gesture recognition
CN105794191A (en) * 2013-12-17 2016-07-20 夏普株式会社 Recognition data transmission device
CN105809144A (en) * 2016-03-24 2016-07-27 重庆邮电大学 Gesture recognition system and method adopting action segmentation
CN107438804A (en) * 2016-10-19 2017-12-05 深圳市大疆创新科技有限公司 A kind of Wearable and UAS for being used to control unmanned plane
CN108345387A (en) * 2018-03-14 2018-07-31 百度在线网络技术(北京)有限公司 Method and apparatus for output information
CN108549490A (en) * 2018-05-03 2018-09-18 林潼 A kind of gesture identification interactive approach based on Leap Motion equipment
CN109455180A (en) * 2018-11-09 2019-03-12 百度在线网络技术(北京)有限公司 Method and apparatus for controlling unmanned vehicle
CN109597485A (en) * 2018-12-04 2019-04-09 山东大学 A kind of gesture interaction system and its working method based on two fingers angular domain feature
CN109635767A (en) * 2018-12-20 2019-04-16 北京字节跳动网络技术有限公司 A kind of training method, device, equipment and the storage medium of palm normal module
CN109889892A (en) * 2019-04-16 2019-06-14 北京字节跳动网络技术有限公司 Video effect adding method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN110263743A (en) 2019-09-20

Similar Documents

Publication Publication Date Title
CN109584276B (en) Key point detection method, device, equipment and readable medium
EP3465620B1 (en) Shared experience with contextual augmentation
US11436863B2 (en) Method and apparatus for outputting data
CN108830235B (en) Method and apparatus for generating information
CN110188719B (en) Target tracking method and device
CN109993150B (en) Method and device for identifying age
CN109871800B (en) Human body posture estimation method and device and storage medium
CN110348419B (en) Method and device for photographing
CN110059623B (en) Method and apparatus for generating information
CN110458218B (en) Image classification method and device and classification network training method and device
JP7181375B2 (en) Target object motion recognition method, device and electronic device
WO2020211573A1 (en) Method and device for processing image
CN112051961A (en) Virtual interaction method and device, electronic equipment and computer readable storage medium
WO2022095674A1 (en) Method and apparatus for operating mobile device
US20140232748A1 (en) Device, method and computer readable recording medium for operating the same
CN111462238A (en) Attitude estimation optimization method and device and storage medium
CN110837332A (en) Face image deformation method and device, electronic equipment and computer readable medium
CN115937033A (en) Image generation method and device and electronic equipment
CN109829431B (en) Method and apparatus for generating information
CN112306235A (en) Gesture operation method, device, equipment and storage medium
CN110008926B (en) Method and device for identifying age
US20210158031A1 (en) Gesture Recognition Method, and Electronic Device and Storage Medium
CN112270242B (en) Track display method and device, readable medium and electronic equipment
CN111601129B (en) Control method, control device, terminal and storage medium
CN110263743B (en) Method and device for recognizing images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant