CN115499580B

CN115499580B - Multi-mode fusion intelligent view finding method and device and image pickup equipment

Info

Publication number: CN115499580B
Application number: CN202210977301.4A
Authority: CN
Inventors: 肖兵; 陈瑞斌; 邱俊锋; 李正国; 廖鑫
Original assignee: Zhuhai Shixi Technology Co Ltd
Current assignee: Zhuhai Shixi Technology Co Ltd
Priority date: 2022-08-15
Filing date: 2022-08-15
Publication date: 2023-09-19
Anticipated expiration: 2042-08-15
Also published as: CN115499580A

Abstract

The invention provides a multi-mode fusion intelligent view finding method, a device and image pickup equipment, wherein the method comprises the following steps: acquiring a video original image of a scene deployed by an image acquisition system, and identifying a hand target in the video original image; detecting the hand target triggering gesture control instruction; setting a view finding mode of the deployment scene based on the gesture control instruction; and determining a view finding target in the deployment scene according to the view finding mode, and outputting an interested region, which is associated with the view finding target, in the original video image. According to the invention, the close-up of people or objects can be realized through simple gesture control, and the image output requirement of meeting scenes or live scenes is met.

Description

Multi-mode fusion intelligent view finding method and device and image pickup equipment

Technical Field

The invention relates to the technical field of intelligent control, in particular to a multi-mode fusion intelligent view finding method, device and camera equipment.

Background

With the development of image acquisition technology, intelligent view finding systems are increasingly developed towards ultrahigh definition, large field of view and intellectualization in image aspects. In the prior art, the image processing technology is carried out on the picture to realize the feature of the portrait, and no matter the portrait is in a single person or a plurality of persons are in a scene, the user in the picture can be in an optimal view, and the redundant background around the portrait in the ultra-wide angle view is reduced, so that the output effect of the scene image is greatly improved.

However, the existing intelligent view finding function generally performs close-up on all users, namely, close-up on multiple persons. There are also individual products with single close-up modes, where the strategy is relatively straightforward, to close up the largest (or most recent) portrait object in the scene, while other participants are not close-up if not in the nearest location. In reality, the user may want the close-up object to be continuously tracked when the relative position of the close-up object and other participants changes, and the policy cannot meet the requirement. Moreover, the existing intelligent view finding function is relatively single, and a user is required to manually adjust a close-up object, so that the operation is complex and the diversified requirements of the user cannot be met.

Disclosure of Invention

In view of the above problems, the present invention proposes a multi-mode fusion intelligent framing method, apparatus and image capturing device that overcomes or at least partially solves the above problems.

According to a first aspect of the present invention, there is provided an intelligent framing method based on gesture control multi-mode fusion, the method comprising:

acquiring a video original image of a scene deployed by an image acquisition system, and identifying a hand target in the video original image;

Detecting the hand target triggering gesture control instruction;

setting a view finding mode of the deployment scene based on the gesture control instruction;

and determining a view finding target in the deployment scene according to the view finding mode, and outputting an interested region, which is associated with the view finding target, in the original video image.

Optionally, the detecting the hand target triggering gesture control instruction includes:

tracking the hand target and generating hand target tracking information of the hand target;

and identifying a corresponding gesture control instruction when judging that the hand target is in a gesture request state based on the hand target tracking information.

Optionally, the setting the view mode of the deployment scenario based on the gesture control instruction includes:

identifying the instruction type of the gesture control instruction, and acquiring preset framing mode switching logic corresponding to the instruction type;

setting a view finding mode of the deployment scene according to the view finding mode switching logic.

Optionally, the view finding mode is any one of a multi-person close-up mode, a single close-up mode, a panoramic mode and an object close-up mode.

Optionally, the determining a view finding target in the deployment scene according to the view finding mode, and outputting the region of interest associated with the view finding target in the video original image includes:

If the view finding mode is a multi-person close-up mode, a plurality of first portrait targets corresponding to the gesture control instruction are identified, and the plurality of first portrait targets are taken as view finding targets; outputting an interested area associated with the second portrait target in the original video image so as to close up a plurality of first portrait targets;

if the view finding mode is a single close-up mode, identifying a second portrait target corresponding to the gesture control instruction, and taking the second portrait target as a view finding target; outputting an interested area associated with the second portrait target in the original video image so as to close up the second portrait target;

if the view finding mode is a panoramic mode, taking the full scene of the deployment scene as a view finding target, and outputting the video original image;

if the view finding mode is an object close-up mode, detecting an object target associated with the hand target, taking the hand target and/or the object target as the view finding target, and outputting a region of interest associated with the hand target and/or the object target in the video original image so as to close up the hand target and/or the object target.

Optionally, if the view finding mode is a single close-up mode, outputting a region of interest in the video primary image associated with the second portrait target to close up the second portrait target, and the method further includes:

If the new hand target in the original video image is detected to trigger the gesture control instruction, a third portrait target corresponding to the gesture control instruction is identified;

and outputting the close-up image of the third portrait target to realize portrait switching in a single close-up mode.

Optionally, the outputting the region of interest associated with the hand target and/or the object target in the video primary image to close up the hand target and/or the object target includes:

determining two hand targets associated with the object target and a hand detection frame of the two hand targets;

and taking the hand detection frames of the two hand targets as an interested region, and cutting and scaling the original video image to close up the hand targets and/or the object targets.

determining an object detection frame of the object target;

And acquiring a common area of the hand detection frames of the two hand targets and the object detection frame, taking the common area as an interested area, and cutting and scaling the original video image so as to close up the hand targets and/or the object targets.

According to a second aspect of the present invention, there is provided a gesture control based multi-mode fusion intelligent viewfinder apparatus, the apparatus comprising:

the image acquisition module is used for acquiring a video original image of a scene deployed by the image acquisition system and identifying a hand target in the video original image;

the gesture control module is used for detecting the hand target trigger gesture control instruction;

a view mode control module for setting a view mode of the deployment scene based on the gesture control instruction;

and the intelligent view finding module is used for determining a view finding target in the deployment scene according to the view finding mode and outputting an interested area, which is associated with the view finding target, in the original video image.

According to a third aspect of the present invention, there is provided a computer readable storage medium for storing program code for performing the method of any one of the first aspects.

According to a fourth aspect of the present invention, there is provided an image pickup apparatus including a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method of any of the first aspects according to instructions in the program code.

The invention provides a multi-mode fusion intelligent view finding method, device and camera equipment, wherein a user can set and switch view finding modes by adopting a gesture control interaction mode, and the multi-mode intelligent view finding method, device and camera equipment have the intelligent view finding functions of multiple modes such as multi-person close-up, single close-up, panoramic mode, object close-up and the like, so that the product experience is more natural, convenient and intelligent. Especially in meeting scenes and live broadcasting scenes of a meeting room, the special of people or objects can be realized through simple gesture control, and the image output requirements of the meeting scenes or the live broadcasting scenes are met. In addition, in the application scene of intelligent house, the user can realize opening/closing of intelligent equipment through the gesture, or operation mode switches, the control of equipment function etc. makes equipment more intelligent, and then promotes the user to intelligent equipment to use experience.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

The above, as well as additional objectives, advantages, and features of the present invention will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present invention when read in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 illustrates a flow diagram of an intelligent framing method based on multi-mode fusion of gesture control according to an embodiment of the invention;

FIG. 2 is a schematic diagram showing the effect of a double hand object according to an embodiment of the present invention

FIG. 3 shows a schematic illustration of a close-up effect of an object according to an embodiment of the invention;

Fig. 4 shows a schematic diagram of a multi-mode fusion intelligent viewfinder according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The embodiment of the invention provides a multi-mode fusion intelligent view finding method based on gesture control, as shown in fig. 1, the multi-mode fusion intelligent view finding method based on gesture control provided by the embodiment of the invention at least can comprise the following steps S101 to S104.

S101, acquiring a video original image of a scene deployed by an image acquisition system, and identifying a hand target in the video original image. The image acquisition system of the embodiment may be an image acquisition system provided with one or more cameras, and the obtained video primary image may be a video primary image in a video stream acquired by the image acquisition system in real time, or may be a continuous video primary image acquired by the image acquisition system in advance.

And performing hand detection on the acquired video original image to identify a hand target in the video original image. Optionally, the video original image may be input into a deep neural network model, so as to implement detection and identification of the hand by using the deep neural network model. The deep neural network of the present embodiment is a neural network model that can be used for multi-objective detection, trained in advance to a converged state. Conventional object detection networks generally support multiple types of object detection, i.e. multiple types of objects can be detected by one inference, so long as the training stage trains for multiple types of objects, for example, a relatively typical network such as YOLO series, SSD, etc. The embodiment can realize detection of multiple types of targets based on the deep neural network, namely, detection and identification can be carried out on multiple different types of targets such as images or hands and the like which are input into the deep neural network.

S102, detecting a hand target trigger gesture control instruction. Tracking and monitoring are performed on each hand target identified in step S101 to detect whether a hand target triggers a gesture control instruction. The gesture control instruction is a shortcut control mode which can be directly used for controlling the image acquisition system to view a view finding mode of a deployment scene, each human object in the deployment scene can trigger the gesture control instruction through the hand of the human object, different gestures can correspondingly trigger different gesture control instructions, and the gesture control instruction can be specifically set according to different scene application requirements.

S103, setting a view finding mode of the deployment scene based on the gesture control instruction.

For the detected hand target, whether a gesture control instruction is triggered or not can be monitored for the continuous video original images collected, and after the gesture control instruction is determined to be triggered, a view finding mode of the deployment scene is set in response to the gesture control instruction. The view mode of the present embodiment may include any one of a multi-person close-up mode, a single close-up mode, a panoramic mode, and an object close-up mode.

S104, determining a view finding target in the deployment scene according to the view finding mode, and outputting an interested region of the associated view finding target in the original video image.

After determining the view finding mode of the deployment scene, the corresponding view finding target in the view finding mode can be determined, and then the region of interest associated with the determined view finding target is output. The region of interest associated with the viewing target is an image region containing the viewing target in the original video image, and the region of interest can be determined according to a detection frame of the viewing target during target detection.

According to the method provided by the embodiment of the invention, the setting and switching of the view finding mode can be realized by adopting the interaction mode controlled by gestures, and the method has the intelligent view finding function of multiple modes such as multi-person close-up, single close-up, panoramic mode, object close-up and the like, so that the product experience is more natural, convenient and intelligent. Especially in meeting scenes and live broadcasting scenes of a meeting room, the special of people or objects can be realized through simple gesture control, and the image output requirements of the meeting scenes or the live broadcasting scenes are met. Besides, the method can also meet the view finding requirements of different scenes for scenes such as a product release meeting scene, a teaching scene or other multi-person/object scenes. Further, after the step S104, when a specific gesture control instruction is detected, the current view finding mode may be exited to display the panoramic image.

In step S102, it is mentioned that the hand target triggering gesture control command needs to be detected, and specifically, tracking and monitoring can be performed on each hand target identified in step S101, and whether the hand target triggers the gesture control command or not is detected in time, which may include the following steps A1 to A2.

A1, tracking the hand target and generating hand target tracking information of the hand target. And (3) tracking the hand target detected in the step (S101) to obtain corresponding hand tracking information such as the ID and the track of the hand target. When the hand target tracking is carried out, a SORT (Simple Online And Realtime Tracking) algorithm is preferable, and the algorithm is a classical multi-target tracking algorithm based on detection and has the characteristics of small calculated amount and high operation speed. Of course, other tracking algorithms such as deep sort may also be used if the application requires and is allowed by the computing forces.

A2, when the hand target is judged to be in the gesture request state based on the hand target tracking information, a corresponding gesture control instruction is recognized.

For each hand target, whether each hand is in a gesture request state or not can be monitored and analyzed, and when any hand target is judged to be in the gesture request state, the gesture type posed by the gesture target is identified. In a preferred embodiment of the present invention, a deep neural network model, which is different from a neural network model for performing hand target detection, may be used to classify only the hand determined to be in the gesture control request state, and is a classification learning model for classifying and identifying gestures. When classifying, the hand image is cut out from the original image according to the hand detection frame, and then sent into the classification network to obtain the corresponding gesture category by reasoning, so as to determine the corresponding gesture control instruction.

Further, the step A2 includes the following steps A2-1 to A2-4 when judging that the hand target is in the gesture request state based on the hand target tracking information.

A2-1, acquiring a hand target detection frame corresponding to the hand target, and analyzing the stable state of the hand target according to the hand target detection frame. The stable state of the hand target of the present embodiment mainly includes a hand posture stable state. In the embodiment, the background generates a detection frame for the hand area at the identification position when detecting the hand object in the original video image by using the deep neural network model, where the detection frame is a frame that can be identified by the background processing or can be seen when demonstrating the algorithm effect, and the detection frame is not normally displayed in the original video image. Optionally, in this embodiment, any hand may be tracked, and whether the hand size and position are stable may be determined according to the hand target detection frame.

Optionally, when making the hand gesture stability determination, i.e. calculating the current hand target detection box B _current And stable hand target detection frame B _stable If IoU exceeds a threshold T _{hand_iou} The hand gesture is considered to be stable, otherwise, the hand gesture is considered to be unstable and the current detection frame B is used _current To update the stable detection frame B _stable 。

In addition, the standard deviation of each parameter of the detection frame can be used as a measurement index, i.e. for any hand tracking target, the detection frame parameters (which can be (x, y, w, h) or (cx, cy, w, h) or (x) of the previous and subsequent frames or the historical frames are calculated ₁ ,y ₁ ,x ₂ ,y ₂ ) And the like), if the standard deviation is smaller than a preset threshold value, the hand gesture is considered to be stable, otherwise, the hand gesture is considered to be unstable.

Hand tracking belongs to multi-target tracking, and the multi-target tracking usually allocates a target loss frame number (skip_count) for each tracking target to record the continuous times that the tracking target is not successfully related to a detection frame, and sets a unified maximum loss frame number threshold (Tmax_skip_count), and when the skip_count is more than or equal to Tmax_skip_count, the target is considered to disappear and is removed from the tracking list. In other words, the number of lost frames of the hand target tracking is used as a measurement index in the present embodiment. Specifically, a target lost frame number threshold T is set _{hand_skip} For any tracked hand target, if the lost frame number of the hand target is not more than T _{hand_skip} (preferably, the value is 1), the hand detection result is considered to be stable, otherwise, the hand detection result is considered to be unstable.

A2-2, determining a portrait target associated with the hand target.

When the human image target is detected, the input video original image can be synchronously detected and identified by adopting a deep neural network learning model for detecting the hand target, so that a plurality of reference human image targets in the video original image are obtained, and the detected human images are subjected to multi-target tracking. Wherein the portrait target can be at least one of human face, human head, head and shoulder, human body, etc. Optionally, the associating of the portrait target and the hand target may include:

a2-2-1, determining gesture candidate areas of all reference portrait objects.

A2-2-2, determining a portrait target associated with the hand target according to the position relation between the hand target detection frame of the hand target and each gesture candidate region. The gesture candidate region is preferably a region related to the upper body of the gesture initiator, and the gesture candidate region should cover most of the region that the gesture initiator may appear to face the human-computer interaction device when the interaction gesture is formally and naturally made in front of the chest or the shoulder. For example, the vertical direction is above the abdomen and below the forehead, and the horizontal direction is between the two arms, and on the basis, the device can be further adjusted in a fine adjustment mode according to actual needs.

The gesture candidate region can be calculated by a corresponding portrait detection frame of the gesture initiator, and when the position of the gesture initiator changes, the corresponding gesture candidate region is automatically adjusted. Taking a human head target as an example, a vertical offset proportion S of a gesture candidate region relative to a human head detection frame is preset _y The ratio of the gesture candidate region to the width and height of the human head detection frame is S respectively _w 、S _h That is, the size of the gesture candidate region may be determined according to the size of the detection frame, specifically may be obtained by scaling up according to parameters of the detection frame in various directions, for example, for any human head detection frame (x _head ,y _head ,w _head ,h _head ) Corresponding gesture candidate region (x _roi ,y _roi ,w _roi ,h _roi ) The calculation formula of each parameter is as follows:

w _roi ＝w _head *S _w

h _roi ＝h _head *S _h

x _roi ＝x _head +w _head /2-w _roi /2

y _roi ＝y _head +h _head *S _y

wherein S is _w 、S _h 、S _y The setting may be performed according to different requirements, which is not limited in this embodiment.

After the data association processing is performed on any hand, a portrait target is associated with the hand, and then the subsequent step processing is continuously performed on the hand. Of course, it is also possible to set a gesture command that only responds to a specific portrait target, that is, determine a portrait target associated with a hand target, and when it is determined by a portrait target ID or other feature information that the portrait target belongs to a preset portrait target having gesture control authority, proceed to the subsequent steps.

A2-2-3, analyzing the stable state of the portrait posture of the portrait target.

Considering that the gesture control is in an intentional state (different from the random state when not interacting) by the user, in addition, in order to ensure the reliability of the gesture control and save some unnecessary subsequent hand classification calculation, the figure stability needs to be analyzed. Specifically, for any hand target related to the portrait target, whether the corresponding portrait detection frame is stable or not is judged, and if so, the subsequent processing is continued for the hand target.

A2-2-4, and judging whether the hand target is in a stable gesture request state or not by combining the stable state of the hand target and the stable state of the portrait target.

For any hand target, if the hand target meets the hand stability condition at the same time, a stable portrait target is associated with the hand target and the corresponding portrait target is relatively stable, the hand target is considered to be in an instantaneous stable state at the current frame (or moment). Further, for each hand target, the number n of frames in which the hand target is stable continuously is counted _stable (or a time period t _stable ) Presetting a unified continuous stable frame number n in advance _stable (or duration T) _stable ) Threshold value, if n _stable ≥N _stable (or t) _stable ≥T _stable ) The hand target is considered to be in a gesture request (ready) state, otherwise it is considered to be in a non-gesture request state. Note that the number of frames n that is stable continuously _stable (or a time period t _stable ) And when any link of the steps A2-2-1 to A2-2-3 is not satisfied, resetting and then accumulating again.

The gesture stability determining method can respond under the condition that the determining request is effective, meets the product requirements of corresponding scenes, saves calculation force, is quite efficient, and can also avoid influencing the use experience of users due to temporary misjudgment of hand actions.

In an alternative embodiment of the present invention, the setting of the view mode of the deployment scenario in step S103 based on the gesture control instruction may include the following steps B1 to B2.

B1, identifying the instruction type of the gesture control instruction, and acquiring preset framing mode switching logic corresponding to the instruction type.

B2, setting a view finding mode of the deployment scene according to the view finding mode switching logic.

The viewing mode switching logic may include a switching mode of the viewing mode, a confirmation mode of the viewing object, and the like. For example, in the view mode switching logic, whether a determination of the current view mode before the view mode switching is required, and whether an intermediate transition view mode needs to be set. For example, when the panoramic mode needs to be switched to the single view mode, whether the panoramic mode needs to be switched to the multi-person view mode is needed, and then the panoramic mode is continuously switched to the single view mode; alternatively, the panoramic view mode is directly switched to the single view mode. Of course, in practical application, the setting of the switching time point involved in the switching process of the framing mode, the framing switching frequency in the fixed time period, and the like may also be performed, which is not limited in the embodiment of the present invention. For example, by matching different gesture types with different instruction types, for the different instruction types, the corresponding view mode switching logic can be preset according to the user requirement or the scene requirement.

In practical applications, the gestures initiated by the hand targets may include single-hand gestures and double-hand gestures, and common single-hand gestures include numbers 1 to 9, "thumb", "OK", "Gun", "single-hand heart ratio", "ilovayou", and so on. Hand gestures can be categorized into cohesive (two-hand contact, such as "two-hand loving", "arched", "closed ten", etc.) and non-cohesive. A stuck-together two-hand gesture theoretically needs to be detected as a separate target type and then further classified. Non-stick two-hand gestures may be combined based on one-hand type. For different gesture matching gesture control instructions in different view finding modes, the matching relation can be specifically set according to different requirements, and the embodiment of the invention is not limited to the matching relation.

In step S104, the manner of determining the view target and the manner of outputting the close-up image are also different for different scene modes.

1. If the view finding mode is a multi-person close-up mode.

Identifying a plurality of first portrait targets corresponding to the gesture control instruction, and taking the plurality of first portrait targets as view finding targets; and outputting the interested areas of the associated first human targets in the original video image to close up the plurality of first human targets.

That is, if the current view mode is the multi-person close-up mode, the region including the plurality of first person objects in the video image may be used as the region of interest, thereby achieving the multi-person close-up. When the multi-person close-up is carried out, the original video image can be cut and scaled to obtain an image containing a plurality of portrait targets and then output, so that the multi-person close-up is realized. Optionally, the individual cropping and scaling can be performed on each of the multiple portrait targets, and the portraits corresponding to the multiple portrait targets are spliced according to a specific mode and then displayed, for example, a splicing display mode such as parallel, four-grid, nine-grid and the like, or a mode of overlapping display on an original image and the like. In this embodiment, when determining a plurality of portrait objects associated with each hand target, reference may be made to the association method between the hand target and the portrait target described in the above A2-2, which is not described herein.

2. The view finding mode is a single close-up mode.

Identifying a second portrait target corresponding to the gesture control instruction, and taking the second portrait target as a view finding target; and outputting an interested area of the associated second portrait target in the original video image so as to close up the second portrait target. If the current view finding mode is single close-up, the portrait target ID of the single close-up gesture initiator can be recorded in the first frame of scene switching, the image in the portrait detection frame is cut and scaled after the corresponding portrait detection frame is determined, and only the single portrait detection frame corresponding to the portrait ID is cut and scaled in the first frame and the subsequent frames to realize close-up display.

Further, if a new hand target trigger gesture control instruction in the original video image is detected, a third image target corresponding to the gesture control instruction is identified; and outputting a close-up image of the third portrait target to realize portrait switching in a single close-up mode. That is, in the single close-up mode, when another participant requests a single close-up through a gesture, the person ID of the new participant is recorded, and the new person ID is sent to the corresponding single person detection frame, so that the conversion of the close-up object in the single close-up mode can be realized.

In this embodiment, identifying the second portrait target corresponding to the gesture triggering control instruction includes: performing portrait detection on the video original image to identify a reference portrait target in the video original image; determining gesture candidate areas of all reference portrait targets; acquiring a hand target detection frame corresponding to a hand target; and determining a portrait target associated with the hand target according to the position relation between the hand target detection frame of the hand target and each gesture candidate region, and taking the portrait target as a second portrait target corresponding to the gesture control instruction. If the hand target detection frame of the hand target is contained in any gesture candidate region, taking the reference portrait target corresponding to the gesture candidate region as a portrait target associated with the hand target; or if the overlapping proportion of the hand target detection frame of the hand target and any gesture candidate frame exceeds a preset value, taking the reference portrait target corresponding to the gesture candidate area as the portrait target associated with the hand target. In this embodiment, the specific implementation manner of determining the portrait target associated with the hand target may refer to the manner described in the above A2-2, and will not be described herein.

3. The view finding mode is a panoramic mode.

Taking the full scene of the deployment scene as a view finding target, and outputting a video original image;

if the current view finding mode is a panoramic mode, all areas of the original video image can be directly used as the interested areas and output and display without confirming any portrait detection frame. Optionally, in this embodiment, the process of entering or exiting the intelligent view-finding mode may be performed intelligently, and when a specific gesture control instruction is detected, the process of entering the intelligent view-finding mode may be performed automatically, otherwise, the process of exiting the view-finding mode may be performed automatically. When the view finding mode is recognized as the panoramic mode, the intelligent view finding mode process can be automatically exited, at the moment, the original image of the display video is automatically output, and when the view finding mode needs to be switched to other view finding modes, the intelligent view finding mode process can be automatically started again.

4. The view mode is an object close-up mode.

Detecting an object target associated with the hand target, taking the hand target and/or the object target as a view finding target, and outputting the associated hand target and/or the object target in the video original image to the region of interest so as to close up the hand target and/or the object target. When the object is close-up, the object target can be close-up, or the hand target and the object target can be close-up.

In one aspect, the method for closing the hand object and/or the object in the present embodiment may include: determining two hand targets associated with the object target and a hand detection frame of the two hand targets; and taking the hand detection frames of the two hand targets as the region of interest, and clipping and scaling the video original image to close up the hand targets and/or the object targets.

In this embodiment, if the current view-finding mode is the object close-up, it is determined whether the user is in a state where both hands display the object according to the hand target and the corresponding hand tracking information, as shown in fig. 2. If so, recording the hand IDs of the corresponding hands, taking the detection frames of the two hands as the interested areas, and then cutting and scaling the images of the hand detection frames of the corresponding two hand targets according to the two IDs, so that the hands holding the object are automatically closed, and naturally, the object is also closed at the same time, as shown in fig. 3.

On the other hand, the method of the present embodiment may further include: determining two hand targets associated with the object target and a hand detection frame of the two hand targets; determining an object detection frame of an object target; and acquiring a common area of the hand detection frames and the object detection frames of the two hand targets, taking the common area as an interested area, and cutting and scaling the video original image so as to close up the hand targets and/or the object targets.

Alternatively, the detection frames of both hands may be replaced with a hand and object common region composed of both hands as the region of interest. Furthermore, the intelligent view finding can be realized after the common areas of the two hands detection frames or the hands and the object are scaled according to the needs, so that the close-up magnification ratio of the object is adjusted, and when the intelligent view finding is performed, the internal logic and parameters are not required to be modified, and the compatibility of the corresponding intelligent view finding module is prevented from being damaged. The various view modes described above can be switched to each other, and the user only needs to make corresponding gestures.

For example, it is desirable to be able to close up an object, i.e. change from portrait close up to object close up, while the handheld article is being displayed. Of course, in some scenarios, the user may need to temporarily close up to enter panoramic mode in order to get a full picture effect. In practical application, the corresponding area of the object detection frame of the object target can be used as the interested area, namely, only the object target is close-up, specifically, the method is set according to different application scenes or types and sizes of the object target, the embodiment of the invention is not limited to the method,

It should be noted that, to accurately determine the intention of the user to display the object, whether based on dynamic gesture recognition or static gesture recognition, it is very challenging. For this reason, the specific object close-up scene provided by the embodiment of the invention: when the user needs to close up the object, he holds the object in his chest for a period of time (e.g., 2-3 s), as shown in fig. 2. Accordingly, the decision strategy for deciding that the user is in the object-showing state is: according to the gesture detection, whether a situation that 2 hands are associated with one figure exists is judged, if yes, the corresponding gesture request state is further checked to be continuously stable for a frame number n _stable (or a time period t _stable )。

Further, optionally, after entering the object close-up mode, the user does not need to keep the gesture and position of the object in the hands unchanged, but can hold the object in the moving position, or change some more natural gestures, such as a finger of a hand (the gesture when being shown and described), so long as the corresponding hand targets of the hands can be continuously detected and tracked, the close-up image can also continuously track the hands, that is, keep the object in close-up tracking. Accordingly, when the user puts down the object, the hands are naturally put down, and the object close-up is naturally exited as long as at least 1 tracked two-hand targets are lost.

To close up an object, it is necessary to determine the area in which the object is located in addition to determining the user's intention to close up the object. Whereas to obtain the object area, the conventional idea is to perform object detection. However, in reality, the objects displayed by the users are various in category and shape, the background of the objects is endlessly layered, and the object detection is extremely challenging to achieve reliable and stable effects. In view of a specific and more general scene that a user holds objects in two hands, the objects can be skillfully contained in the scene that the objects are located between the two hands, and the hand detection effect is easy to ensure, and the displayed objects can be skillfully contained in the scene according to the two hands or a common area (see the common area of the hands and the objects shown in fig. 2) formed by the two hands. The scheme is simple and easy to implement, not only can meet the product requirement, but also can reuse the existing modules and technologies so as to save the research and development cost, and does not need to worry about calculation cost. Based on the same inventive concept, the embodiment of the present invention further provides a gesture control based multi-mode fusion intelligent view finding device, as shown in fig. 4, where the gesture control based multi-mode fusion intelligent view finding device of the present embodiment may include:

The image acquisition module 410 is configured to acquire a video original image of a scene deployed by the image acquisition system, and identify a hand target in the video original image;

the gesture control module 420 is configured to detect a gesture control command triggered by a hand target;

a view mode control module 430 for setting a view mode of the deployment scenario based on the gesture control instruction;

the intelligent view finding module 440 is configured to determine a view finding target in the deployment scene according to the view finding mode, and output a region of interest of the associated view finding target in the video original image.

In an alternative embodiment of the present invention, gesture control module 420 may also be configured to:

and identifying a corresponding gesture control instruction when the hand target is judged to be in the gesture request state based on the hand target tracking information.

In an alternative embodiment of the present invention, the viewfinder mode control module 430 may also be configured to:

identifying an instruction type of a gesture control instruction, and acquiring preset framing mode switching logic corresponding to the instruction type;

Optionally, the view mode is any one of a multi-person close-up mode, a single close-up mode, a panoramic mode, and an object close-up mode.

In an alternative embodiment of the present invention, the smart viewfinder module 440 may also be used to:

if the view finding mode is an object close-up mode, detecting an object target associated with the hand target, taking the hand target and/or the object target as the view finding target, and outputting a region of interest associated with the hand target and/or the object target in the video original image so as to close up the hand target and/or the object target. .

if a new hand target triggering gesture control instruction in the original video image is detected, a third image target corresponding to the gesture control instruction is identified;

and outputting a close-up image of the third portrait target to realize portrait switching in a single close-up mode.

and taking the hand detection frames of the two hand targets as the region of interest, and clipping and scaling the video original image to close up the hand targets and/or the object targets.

determining an object detection frame of an object target;

and acquiring a common area of the hand detection frames and the object detection frames of the two hand targets, taking the common area as an interested area, and cutting and scaling the video original image so as to close up the hand targets and/or the object targets.

The embodiment of the invention also provides a computer readable storage medium for storing program code for executing the method described in the above embodiment.

The embodiment of the invention also provides an image pickup apparatus, which comprises a processor and a memory: the memory is used for storing program codes and transmitting the program codes to the processor; the processor is configured to execute the method according to the above embodiment according to the instructions in the program code. Of course, the image capturing apparatus further includes an optical component for implementing image capturing, and the optical component may include an optical component such as a lens, an optical filter, or other common basic components of the image capturing apparatus such as a housing.

It will be clear to those skilled in the art that the specific working processes of the above-described systems, devices, modules and units may refer to the corresponding processes in the foregoing method embodiments, and for brevity, the description is omitted here.

In addition, each functional unit in the embodiments of the present invention may be physically independent, two or more functional units may be integrated together, or all functional units may be integrated in one processing unit. The integrated functional units may be implemented in hardware or in software or firmware.

Those of ordinary skill in the art will appreciate that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in essence or in whole or in part in the form of a software product stored in a storage medium, which includes instructions for causing an image capturing apparatus (e.g., a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random-access memory (RAM), a magnetic disk, or an optical disk, etc.

Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a personal computer, a server, or a computing device such as a network device) associated with program instructions, where the program instructions may be stored on a computer-readable storage medium, and where the program instructions, when executed by a processor of the computing device, perform all or part of the steps of the method according to the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all technical features thereof can be replaced by others within the spirit and principle of the present invention; such modifications and substitutions do not depart from the scope of the invention.

Claims

1. An intelligent framing method based on multi-mode fusion of gesture control is characterized by comprising the following steps:

detecting the hand target triggering gesture control instruction; comprising the following steps: tracking the hand target and generating hand target tracking information of the hand target; when judging that the hand target is in a gesture request state based on the hand target tracking information, identifying a corresponding gesture control instruction;

setting a view finding mode of the deployment scene based on the gesture control instruction; the view finding mode is any one of a multi-person close-up mode, a single close-up mode, a panoramic mode and an object close-up mode;

Determining a view finding target in the deployment scene according to the view finding mode, and outputting an interested region, which is associated with the view finding target, in the original video image;

wherein determining that the hand target is in a gesture request state based on the hand target tracking information comprises:

acquiring a hand target detection frame corresponding to a hand target, and analyzing the stable state of the hand target according to the hand target detection frame;

determining a portrait target associated with the hand target;

analyzing the stable state of the portrait target;

and judging whether the hand target is in a stable gesture request state or not by combining the stable state of the hand target and the stable state of the portrait target.

2. The method of claim 1, wherein the setting a viewfinder mode of the deployment scenario based on the gesture control instruction comprises:

3. The method of claim 1, wherein determining a viewing target in the deployment scene according to the viewing mode, and outputting a region of interest in the video raw image associated with the viewing target comprises:

If the view finding mode is a multi-person close-up mode, a plurality of first portrait targets corresponding to the gesture control instruction are identified, and the plurality of first portrait targets are taken as view finding targets; outputting an interested area associated with the first human targets in the original video image so as to close up a plurality of the first human targets;

if the view finding mode is an object close-up mode, detecting an object associated with the hand object, taking the hand object and/or the object as the view finding object, and outputting a region of interest associated with the hand object and/or the object in the original video image so as to close up the hand object and/or the object.

4. The method of claim 3, wherein if the viewfinder mode is a single close-up mode, outputting a region of interest in the video raw image associated with the second portrait target to close up the second portrait target, the method further comprising:

5. A method according to claim 3, wherein said outputting a region of interest in the video raw image associated with the hand target and/or object target to close up the hand target and/or object target comprises:

6. A method according to claim 3, wherein said outputting a region of interest associated with said hand target and/or object target in said video raw image to close up said hand target and/or said object target comprises:

Determining an object detection frame of the object target;

7. An intelligent viewfinder device based on multi-mode fusion of gesture control, characterized in that the device comprises:

the gesture control module is used for detecting the hand target trigger gesture control instruction; comprising the following steps: tracking the hand target and generating hand target tracking information of the hand target; when judging that the hand target is in a gesture request state based on the hand target tracking information, identifying a corresponding gesture control instruction; judging that the hand target is in a gesture request state based on the hand target tracking information comprises: acquiring a hand target detection frame corresponding to a hand target, and analyzing the stable state of the hand target according to the hand target detection frame; determining a portrait target associated with the hand target; analyzing the stable state of the portrait target; judging whether the hand target is in a stable gesture request state or not by combining the stable state of the hand target and the stable state of the portrait target;

A view mode control module for setting a view mode of the deployment scene based on the gesture control instruction; the view finding mode is any one of a multi-person close-up mode, a single close-up mode, a panoramic mode and an object close-up mode;

8. An image pickup apparatus, characterized in that the image pickup apparatus includes a processor and a memory:

the processor is configured to perform the method of any of claims 1-6 according to instructions in the program code.

9. A computer readable storage medium for storing program code for performing the method of any one of claims 1-6.

10. A multi-mode fusion intelligent view finding system based on gesture control, which is characterized by comprising the intelligent view finding device as claimed in claim 7.