CN113630556A

CN113630556A - Focusing method, focusing device, electronic equipment and storage medium

Info

Publication number: CN113630556A
Application number: CN202111128515.6A
Authority: CN
Inventors: 孔祥晖
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-11-09

Abstract

The present disclosure provides a focusing method, apparatus, electronic device, and computer-readable storage medium. Firstly, acquiring a target video in a target area, identifying preset objects in the target video, and screening target objects for executing preset behaviors from the preset objects; then, based on the target video, the position of the target object is determined, and the position of the camera device is controlled to focus on the position of the target object. According to the method, the target object executing the preset behavior can be accurately determined in the mode of image recognition positioning and target object focusing, and the efficiency can be effectively improved compared with the mode of manually positioning and focusing the target object; meanwhile, compared with a mode of positioning and focusing a target object through sound, the accuracy can be effectively improved.

Description

Focusing method, focusing device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing and computer vision technologies, and in particular, to a focusing method, an apparatus, an electronic device, and a storage medium.

Background

With the popularization of video conferences, video conversations and video social scenes, how to quickly locate and focus on a current speaker becomes a technical problem to be solved in a multi-person scene.

At present, the orientation and the focal length of a camera can be controlled through a manual focusing mode, so that the situation that a speaker is centered in the field angle FOV of the camera is realized, but the mode has low efficiency and is not easy to control frequently; or the speaker direction can be positioned through sound source judgment, namely, the sound source direction is judged, and then focusing operation is carried out according to the identified direction, but accurate positioning and focusing of the sound source direction are difficult to realize in the realization process.

Disclosure of Invention

The embodiment of the disclosure at least provides a focusing method, a focusing device, an electronic device and a computer readable storage medium.

In a first aspect, an embodiment of the present disclosure provides a focusing method, including:

acquiring a target video in a target area;

identifying preset objects in the target video, and screening target objects for executing preset behaviors from the preset objects;

and determining the position of the target object based on the target video, and controlling the pose of the camera device to focus the camera device on the position of the target object.

In this respect, the target object executing the preset behavior can be determined more accurately by means of image recognition, positioning and focusing on the target object, for example, the speaking object can be determined more accurately, and efficiency can be improved effectively compared with a method of manually positioning and focusing on the target object; meanwhile, compared with a mode of positioning and focusing a target object through sound, the accuracy can be effectively improved.

In a possible implementation manner, the identifying preset objects in the target video and screening target objects performing preset actions from the preset objects includes:

identifying a preset object in the target video, and determining the times of executing a preset action of the preset object in a preset time period;

and screening target objects for executing preset behaviors from the preset objects based on the times and a first preset threshold value.

In this embodiment, the preset action and the number of times of executing the preset action are used to represent whether the subject performs the preset action, for example, opening or closing is used as the preset action, speaking or speaking is used as the preset action, and whether the subject is speaking or not can be accurately determined by the number of times of opening or closing the mouth.

In a possible implementation manner, the determining the number of times that the preset object performs the preset action within the preset time period includes:

acquiring the at least one target image from the target video;

respectively determining first feature point information of a first preset part of the preset object in each target image;

and determining the times of executing preset actions of the first preset part in a preset time period based on the first characteristic point information respectively corresponding to the plurality of target images.

In this embodiment, the first preset portion corresponds to the preset action and is a portion for executing the preset action, for example, when the preset action is opened or closed, the preset portion may be a mouth, so that the number of times of executing the preset action by the first preset portion in the preset time period can be determined more accurately through the first feature point information of the first preset portion.

In a possible embodiment, the first predetermined location comprises a mouth; the preset action comprises opening or closing; the first feature point information comprises mouth key point information;

the determining, based on the first feature point information respectively corresponding to the plurality of target images, a number of times that the first preset portion executes a preset action within a preset time period includes:

for each target image, determining first distance information between two target key points of the mouth based on the mouth key point information corresponding to the target image;

and determining the opening or closing times of the mouth part in a preset time period based on the first distance information and the second preset threshold respectively corresponding to each target image.

In this embodiment, two specific key points of the mouth, that is, the first distance information between the two target key points, may be significantly different when the mouth is in the open state or the closed state, and distances corresponding to the first distance information may be within a certain distance range when the mouth is in the open state or the closed state, respectively, so that the number of times that the mouth is opened or closed in the preset time period may be determined more accurately through the determined first distance information.

In a possible implementation manner, before determining the number of times that the mouth is opened or closed within a preset time period based on the first distance information and a second preset threshold respectively corresponding to each target image, the method further includes a step of determining the second preset threshold:

determining second feature point information of a second preset part of the preset object based on the target video;

determining second distance information between the preset object and a device for shooting the target video based on the second feature point information;

determining the second preset threshold based on the second distance information.

In this embodiment, the distance between the device for capturing the target video and the preset object directly affects the determined first distance information, and only when the second preset threshold matches the distance, the number of times that the mouth is opened or closed within the preset time period can be accurately determined based on the first distance information and the second preset threshold; for example, when the distance between the device for shooting the target video and the preset object is small, the larger the distance value corresponding to the determined first distance information is, the larger the second preset threshold needs to be set, otherwise, the number of times of determining the second preset threshold is mistakenly caused. In the embodiment, the second distance information between the preset object and the device for shooting the target video can be accurately determined based on the second feature point information of the second preset part, and the second distance threshold can be accurately determined on the basis of the accurate second distance information, so that the accuracy of the determined times is improved.

In one possible embodiment, the second predetermined location comprises a face; the second feature point information includes face keypoint information.

According to the embodiment, the second distance information between the preset object and the device for shooting the target video can be accurately determined through the detected face key information.

In a possible implementation manner, the identifying a preset object in the target video and determining the number of times that the preset object performs a preset action within a preset time period includes:

determining a sub-video corresponding to a second preset part of each preset object based on the target video;

for each preset object, determining whether the preset object in each sub-image in the sub-video executes a preset action or not based on the sub-video corresponding to the preset object, and obtaining a recognition result;

and determining the times of executing the preset action by the preset object in a preset time period based on the identification result.

The embodiment can be realized by using a trained model, such as an action recognition model, which is obtained by using a large number of sample images through multiple rounds of iterative training, so that the detection precision is high, whether the preset action is executed by the preset object can be accurately determined by using the trained action recognition model, and then the number of times of executing the preset action by the preset object in a preset time period can be accurately determined; in addition, in the embodiment, the sub-video including the partial image area is extracted from the target video for detection, and the detection is not directly performed by using the whole image, so that the data processing amount can be effectively reduced, and the detection efficiency can be improved.

In a possible implementation manner, after the identifying the preset object in the target video, the method further includes:

setting identity identification information for each preset object based on the target video, and determining the position of each preset object in the image of the target video;

aiming at each preset object, establishing a mapping relation between the identity identification information of the preset object and the position of the preset object in the image to which the preset object belongs;

the determining the position of the target object based on the target video comprises:

determining the position of the target object in the image based on the identification information of the target object and the mapping relation;

and determining the position of the target object in the physical world based on the position of the target object in the image.

According to the embodiment, the position of the target object in the image to which the target object belongs can be accurately determined based on the mapping relation between the pre-established identity identification information and the position, and then the position of the target object in the physical world can be accurately determined based on the position of the target object in the image to which the target object belongs by combining internal and external reference information of a device for shooting the target video and a pre-prepared map.

In a possible embodiment, the controlling the pose of the camera to focus the camera on the position of the target object includes:

determining orientation information of the camera based on determining the position of the target object in the physical world;

based on the orientation information, controlling a pose of an imaging device to focus the imaging device on the position of the target object.

In this embodiment, when the position of the target object in the physical world is determined, the orientation information of the imaging device can be determined more accurately, and the focusing of the imaging device on the target object can be adjusted more accurately using the orientation information.

In a second aspect, embodiments of the present disclosure provide a focusing apparatus, including:

the image acquisition module is used for acquiring a target video in a target area;

the object positioning module is used for identifying preset objects in the target video and screening target objects for executing preset behaviors from the preset objects;

and the focusing module is used for determining the position of the target object based on the target video and controlling the pose of the camera device so as to focus the camera device on the position of the target object.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any possible implementation of the first aspect.

In a fourth aspect, this disclosed embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

For the description of the effects of the focusing apparatus, the electronic device, and the computer-readable storage medium, reference is made to the description of the focusing method, which is not repeated herein.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 illustrates a flow chart of a focusing method provided by an embodiment of the present disclosure;

FIG. 2 illustrates a schematic diagram of key points of a mouth provided by an embodiment of the present disclosure;

FIG. 3 illustrates a schematic view of a focusing assembly provided by embodiments of the present disclosure;

fig. 4 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

Research shows that in the prior art, a currently speaking person is generally positioned and focused in two ways, firstly, the orientation and the focal length of a camera are controlled in a manual focusing way, so that the speaker is centered in the field angle FOV of the camera, but the way has low efficiency and is not easy to control frequently; firstly, the speaker direction is determined by the sound direction, that is, the speaker direction is located by determining the sound direction, and then the focusing operation is performed according to the identified direction, but the method has large error and inaccurate sound direction determination.

In view of the above-mentioned drawbacks, the present disclosure provides a focusing method, apparatus, electronic device, and computer-readable storage medium, wherein a target video in a target area is first acquired; then, identifying preset objects in the target video, and screening target objects for executing preset behaviors from the preset objects; and finally, determining the position of the target object based on the target video, and controlling the pose of a camera device to enable the camera device to focus on the position of the target object. According to the method, the target object executing the preset behavior can be accurately determined in the mode of image recognition positioning and target object focusing, for example, the object speaking can be accurately determined, and efficiency can be effectively improved compared with the mode of manually positioning and focusing the target object; meanwhile, compared with a mode of positioning and focusing a target object through sound, the accuracy can be effectively improved.

The focusing method provided by the embodiment of the present disclosure is described below by taking an execution subject as a device having computing power as an example.

As shown in fig. 1, the focusing method provided by the present disclosure may include the steps of:

and S110, acquiring a target video in the target area.

The target area may be different in different application scenarios, for example, in a meeting scenario, the target area may be a meeting room, and the target video is a captured image in the meeting room.

The device for capturing the target video may be an imaging device described below that is to perform a focusing operation, or may be any adjustable/non-adjustable device that can capture a video within the target area.

S120, identifying preset objects in the target video, and screening target objects executing preset behaviors from the preset objects.

The method comprises the steps that a preset object is an object to be detected whether a preset behavior is executed or not, and before detection, at least part of the preset object in a target video needs to be identified based on the target video; after the preset objects are identified, whether each preset object executes the preset behavior can be determined by using the trained model, so that the target object executing the preset behavior can be screened from the preset objects. Naturally, the target object may also be screened in other manners, for example, the preset action corresponding to the preset action may be determined first, then the preset object executing the preset action is determined through the target video, and then the target object is screened from the preset object executing the preset action.

In a conference scenario, the preset object may be a person participating in a conference, and the target object may be a person speaking. When screening persons who speak, it is necessary to screen out persons who continue to have mouth opening or closing motions using the target video.

And S130, determining the position of the target object based on the target video, and controlling the pose of the camera to enable the camera to focus on the position of the target object.

In determining the position of the target object, the target object may be determined by using any one of the target videos including the image of the target object, for example, first determining the position of the target object in a certain image, then determining the position of the target object in the physical world by combining internal and external reference information of the device for shooting the target video and a pre-prepared map, and then, based on the determined position of the target object in the physical world, achieving more accurate focusing of the camera on the target object.

The image capturing device and the device for capturing the target video may be the same device or different devices, and the disclosure is not limited thereto.

The embodiment can accurately determine the target object executing the preset action, such as the speaking object, by means of image recognition, positioning and focusing of the target object, and can effectively improve efficiency compared with a method of manually positioning and focusing the target object; meanwhile, compared with a mode of positioning and focusing a target object through sound, the accuracy can be effectively improved.

In some embodiments, the identifying of the preset objects in the target video and the screening of the target objects performing the preset actions from the preset objects may be implemented by the following steps:

firstly, a preset object in the target video is identified, and the times of executing a preset action by the preset object in a preset time period are determined. And then, screening target objects for executing preset behaviors from the preset objects based on the times and a first preset threshold value.

Illustratively, a preset object in which a number of times of performing a preset action within a preset time period is greater than or equal to a first preset threshold is taken as the target object.

For example, when the preset object is recognized and/or the number of times that the preset object performs the preset action within the preset time period is detected, a trained model may be used to detect whether the preset object performs the preset action, and then the number of times that the preset object performs the preset action within the preset time period is determined in combination with information such as the shooting time of the image. The model is implemented through multiple iterations using multiple sample images including different preset objects during training. The trained model has high detection precision, can accurately identify whether the preset object executes the preset sending operation, and then can accurately determine the times of the preset object executing the preset action in the preset time period.

For example, a part for performing the preset action may be determined, and then the part for performing the preset action is detected by means of image feature detection, and the number of times that the preset object performs the preset action within the preset time period may be determined according to the detection result.

The preset actions and the execution times of the preset actions are used for representing whether the object executes the preset actions, for example, opening or closing is used as the preset actions, speaking or speaking is used as the preset actions, and whether the object is speaking or not can be accurately determined through the times of opening or closing the mouth.

For example, in a conference scenario, the preset action may be set as an action of opening or closing a mouth, the preset threshold may be set as 3, and the preset time period may be set as 5ms, that is, if it is detected that the number of times of opening or closing a mouth of a certain preset object within 5ms is greater than or equal to 3, the preset object is considered as speaking, and the preset object is the target object.

According to the embodiment, whether the object executes the preset behavior can be accurately determined through the execution times of the preset action and the first preset threshold.

In some embodiments, the number of times the preset object performs the preset action within the preset time period may be determined by:

firstly, acquiring at least one target image from the target video; then, respectively determining first feature point information of a first preset part of the preset object in each target image; and finally, determining the times of executing preset actions of the first preset part in a preset time period based on the first characteristic point information respectively corresponding to the plurality of target images.

The first preset portion is a portion where the preset object performs a preset action, for example, when the preset action is an action of opening or closing a mouth, and the first preset portion is a mouth, in this case, the first feature point information includes key point information of the mouth.

When determining the first feature point information, the determination may be performed by using a pre-trained model, for example, using a trained face key point model, to determine face key point information of a preset object, and then determining mouth key point information based on the face key point information. The face key point models can be different models, and different face key point information can be determined by using different face key point models because different models can detect different numbers of face key points. Illustratively, some face keypoint models can detect 106 face keypoints and some face keypoint models can detect 240 face keypoints. The information marked by more key points of the mouth is more accurate, and the information of the first characteristic point is more accurate.

The first preset portion corresponds to the preset action and is a portion for executing the preset action, for example, when the preset action is opened or closed, the preset portion may be a mouth, and therefore, the number of times of executing the preset action by the first preset portion in the preset time period can be determined more accurately through the first feature point information of the first preset portion.

For example, the number of times that the first preset portion performs the preset action within the preset time period may be specifically implemented by the following steps: firstly, for each target image, determining first distance information between two target key points of the mouth based on the mouth key point information corresponding to the target image; and then, determining the opening or closing times of the mouth part in a preset time period based on the first distance information and the second preset threshold value respectively corresponding to each target image.

The distance between two specific key points of the mouth, namely the target key points, can be different due to different actions or states of the mouth and can be within a certain distance range in the open state and the closed state respectively, so that the state of the mouth can be determined according to first distance information between the two target key points, and the opening or closing times of the mouth in a preset time period can be determined based on the state of the mouth in continuous multi-frame images.

Illustratively, as shown in fig. 2, the target keypoints may be selected from a point 98 and a point 102, and when the distance between the point 98 and the point 102 exceeds a second preset threshold, it may be authenticated that the preset object performs one action of opening the mouth.

The first distance information between the two target key points of the mouth can be accurately determined based on the determined key point information of the mouth, and the accurate first distance information is favorable for improving the accuracy of the determined times.

The distance between the device for shooting the target video and the preset object directly influences the determined first distance information, and the opening or closing times of the mouth in the preset time period can be accurately determined based on the first distance information and the second preset threshold only when the second preset threshold is matched with the distance; for example, when the distance between the device for shooting the target video and the preset object is small, the larger the distance value corresponding to the determined first distance information is, the larger the second preset threshold needs to be set, otherwise, the number of times of determining the second preset threshold is mistakenly caused. Illustratively, the second preset threshold may be determined by the following steps:

step one, second feature point information of a second preset part of the preset object is determined based on the target video.

In specific implementation, an image is selected from the target video, the image is detected, and second feature point information of a second preset part of the preset object is determined. The second feature point information here may be face key point information of a face of a preset object.

Of course, the second preset portion is not limited in the present disclosure, and the second preset portion may be the same as the first preset portion, or may be another portion of the preset object, for example, a leg portion of the preset object.

And secondly, determining second distance information between the preset object and a device for shooting the target video based on the second characteristic point information.

Illustratively, second distance information between a preset object and a device capturing the target video is determined using the facial keypoint information.

And thirdly, determining the second preset threshold value based on the second distance information.

For example, when the distance corresponding to the second distance information is larger, the second preset threshold is set to be smaller, and when the distance corresponding to the second distance information is smaller, the second preset threshold is set to be larger. For example, when it is determined that the distance corresponding to the second distance information is 3m, the second preset threshold may be set to a length value corresponding to 20 pixel points in a certain image of the target video, and when it is determined that the distance corresponding to the second distance information is 1m, the second preset threshold may be set to a length value corresponding to 40 pixel points in a certain image of the target video.

In the embodiment, the second distance information between the preset object and the device for shooting the target video can be accurately determined based on the second feature point information of the second preset part, and the second distance threshold can be accurately determined on the basis of the accurate second distance information, so that the accuracy of the determined times is improved.

In some embodiments, the following steps may be further utilized to identify a preset object in the target video, and determine the number of times that the preset object performs a preset action within a preset time period:

firstly, determining a sub-video corresponding to a second preset part of each preset object based on the target video; then, for each preset object, determining whether the preset object in each sub-image in the sub-video executes a preset action based on the sub-video corresponding to the preset object to obtain a recognition result, exemplarily, inputting the sub-video corresponding to the preset object into a trained action recognition model, and determining whether the preset object in each sub-image in the sub-video executes the preset action to obtain the recognition result; and finally, determining the times of executing the preset action of the preset object in a preset time period based on the identification result.

The sub-video includes a plurality of sub-images, and the sub-images may be obtained by capturing an image region of a second predetermined region from the image of the target video. Illustratively, the above-described sub-image is a sub-image corresponding to a face of a preset object.

The motion recognition model is obtained by utilizing a large number of sample images through multiple rounds of iterative training, so that the detection precision is high, whether the preset object executes the preset motion or not can be accurately determined by utilizing the trained motion recognition model, and then, the times of executing the preset motion by the preset object in the preset time period can be accurately determined; in addition, the embodiment extracts the sub-video including the partial image area from the target video for detection, and the detection is not directly performed by using the whole image, so that the data processing amount can be effectively reduced, and the detection efficiency can be improved.

In some embodiments, after or during the process of identifying the preset objects in the target video, identity information is set for each preset object, and a position of each preset object in the image to which the preset object belongs is determined; and then, aiming at each preset object, establishing a mapping relation between the identity identification information of the preset object and the position of the preset object in the image to which the preset object belongs.

When determining the position of the target object based on the target video, firstly, determining the position of the target object in the image to which the target object belongs based on the identification information of the target object and the mapping relation; then, based on the position of the target object in the target image, the position of the target object in the physical world is determined.

The image to which the above-mentioned belongs may be any image in the target video including the target object.

The position of the target object in the image to which the target object belongs can be accurately determined based on the mapping relation between the pre-established identity identification information and the position, and then the position of the target object in the physical world can be accurately determined based on the position of the target object in the image to which the target object belongs by combining internal and external reference information of a device for shooting a target video and a pre-made map.

After determining the position of the target object in the physical world, determining orientation information of the image pickup device based on the determined position of the target object in the physical world, and then controlling the pose of the image pickup device based on the orientation information so that the image pickup device focuses on the position of the target object.

For example, when the camera is adjusted to focus on the target object based on the orientation information, current initial orientation information of the camera may be first acquired, then a rotation angle, which the camera needs to rotate, is determined based on the determined orientation information of the camera and the initial orientation information, and then the camera is driven to rotate by the driving device, and then the camera is adjusted to focus on the target object based on the distance between the camera and the target object.

The above embodiment can more accurately determine the orientation information of the imaging device when the position of the target object in the physical world is determined, and can more accurately adjust the focusing of the imaging device on the target object by using the orientation information.

The embodiment improves the focusing efficiency, can identify the target object executing the preset behavior in real time, can finish focusing at 30fps or higher frequency, improves the accuracy of screening the target object, and has very high accuracy rate which can reach more than 90% compared with methods such as voice recognition.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, a focusing apparatus corresponding to the focusing method is also provided in the embodiments of the present disclosure, and since the principle of the apparatus in the embodiments of the present disclosure for solving the problem is similar to the focusing method described above in the embodiments of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 3, there is shown a schematic structural diagram of a focusing apparatus according to an embodiment of the present disclosure, the apparatus includes:

an image obtaining module 301, configured to obtain a target video in a target area.

An object positioning module 302, configured to identify preset objects in the target video, and screen a target object that executes a preset behavior from the preset objects.

A focusing module 303, configured to determine a position of the target object based on the target video, and control a pose of an image capturing apparatus to focus the image capturing apparatus on the position of the target object.

In some embodiments, the object location module 302, when identifying preset objects in the target video and screening a target object performing a preset action from the preset objects, is configured to:

In some embodiments, the object locating module 302, when determining the number of times the preset object performs the preset action within the preset time period, is configured to:

acquiring the at least one target image from the target video;

In some embodiments, the first predetermined location comprises a mouth; the preset action comprises opening or closing; the first feature point information comprises mouth key point information;

the object positioning module 302 is configured to, when determining, based on the first feature point information respectively corresponding to the plurality of target images, the number of times that the first preset portion executes a preset action within a preset time period,:

In some embodiments, the object locating module 302 is further configured to determine a second preset threshold before determining the number of times the mouth is opened or closed within a preset time period based on the first distance information and the second preset threshold respectively corresponding to each target image:

In some embodiments, the second predetermined location comprises a face; the second feature point information includes face keypoint information.

In some embodiments, the object locating module 302, when identifying a preset object in the target video and determining the number of times that the preset object performs a preset action within a preset time period, is configured to:

In some embodiments, after the identifying the preset object in the target video, the object location module 302 is further configured to:

the focus module 303, when determining the location of the target object based on the target video, is configured to:

In some embodiments, the focusing module 303, when controlling the pose of the camera to focus the camera on the position of the target object, is configured to:

Based on the same technical concept, the embodiment of the disclosure also provides an electronic device. Referring to fig. 4, a schematic structural diagram of an electronic device 400 provided in the embodiment of the present disclosure includes a processor 41, a memory 42, and a bus 43. The memory 42 is used for storing execution instructions and includes a memory 421 and an external memory 422; the memory 421 is also referred to as an internal memory, and is used for temporarily storing the operation data in the processor 41 and the data exchanged with the external memory 422 such as a hard disk, the processor 41 exchanges data with the external memory 422 through the memory 421, and when the electronic device 400 operates, the processor 41 communicates with the memory 42 through the bus 43, so that the processor 41 executes the following instructions:

acquiring a target video in a target area; identifying preset objects in the target video, and screening target objects for executing preset behaviors from the preset objects; and determining the position of the target object based on the target video, and controlling the pose of the camera device to focus the camera device on the position of the target object.

Embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the steps of the focusing method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The computer program product of the depth estimation method provided in the embodiments of the present disclosure includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the steps of the focusing method described in the above method embodiments, which may be referred to specifically for the above method embodiments, and are not described herein again. The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A focusing method, comprising:

acquiring a target video in a target area;

2. The focusing method of claim 1, wherein the identifying preset objects in the target video and screening target objects performing preset actions from the preset objects comprises:

3. The focusing method according to claim 2, wherein the determining the number of times the preset object performs the preset action within the preset time period comprises:

acquiring at least one target image from the target video;

4. The focusing method according to claim 3, wherein the first predetermined portion comprises a mouth; the preset action comprises opening or closing; the first feature point information comprises mouth key point information;

5. The focusing method according to claim 4, further comprising, before determining the number of times the mouth is opened or closed within a preset time period based on the first distance information and a second preset threshold corresponding to each target image, a step of determining the second preset threshold:

6. The focusing method according to any one of claims 2 to 5, wherein the identifying a preset object in the target video and determining the number of times that the preset object performs a preset action within a preset time period comprises:

7. The focusing method according to any one of claims 1 to 6, further comprising, after the identifying a preset object in the target video:

8. The focusing method according to any one of claims 1 to 7, wherein the controlling the pose of an image pickup device to focus the image pickup device on the position of the target object comprises:

9. A focusing assembly, comprising:

10. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the focusing method of any one of claims 1 to 8.

11. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the focusing method as claimed in any one of the claims 1 to 8.