WO2022160748A1

WO2022160748A1 - Video processing method and apparatus

Info

Publication number: WO2022160748A1
Application number: PCT/CN2021/120411
Authority: WO
Inventors: 陈文明; 邓高锋; 张世明; 吕周谨; 倪世坤
Original assignee: 深圳壹秘科技有限公司
Priority date: 2021-01-29
Filing date: 2021-09-24
Publication date: 2022-08-04
Also published as: CN112907617A; CN112907617B

Abstract

A video processing method and apparatus. The method comprises: acquiring a sensor frame captured by a video sensor, the sensor frame being an image box of an entire frame captured by the video sensor; detecting target boxes in the sensor frame, wherein the target boxes are human body image boxes and/or image boxes comprising a human body in the sensor frame; determining a field of view box according to the target boxes, wherein the field of view box is an image box comprising all the target boxes; determining all target boxes that can determine boundaries of the field of view box, and determining whether all the target boxes that can determine the boundaries of the field of view box are stationary; and when it is determined that all the target boxes that can determine the boundaries of the field of view box are stationary, outputting the field of view box. According to the solution, automatic and real-time tracking of participants in a conference hall can be implemented.

Description

A video processing method and device thereof

technical field

The invention relates to the technical field of video processing, and in particular, to the technical field of video processing for portrait tracking.

Background technique

With the rapid development of science and technology today, the conference mode in which people communicate remotely on the network through audio and video terminals has become very common. The video images of one conference site are acquired through cameras, transmitted from the video image to the other conference site, and displayed on the display device of the other conference site.

However, if the participants only occupy a part of the venue space when participating in the conference, the camera device of the venue needs to automatically track and focus the participants. The empty space occupies the screen and makes the screen of the participants smaller. In this way, it is not conducive to the exchanges between the participants on both sides.

In the existing audio and video calling products, some use the motor to control the auto focus, but these products sometimes make mistakes, such as: focus on the foreground or background instead of the subject, or lock on other things If the light is dim, it will also have a great impact on the autofocus. Moreover, auto-focusing takes a certain amount of time, the time delay is relatively large, and the real-time performance is relatively weak. Other products use hardware hinges to control the steering of the lens, such as: adding connected sensors, alarms, pan-tilts, and lens controllers to achieve search and target locking, but in conference scenarios, if the lens controller is used to Control the direction of the camera, so that the participants will always pay attention to the direction of the camera so that the video shot in the venue will have the best effect, which is not conducive to the progress of the conference.

SUMMARY OF THE INVENTION

The present application provides a video processing method and device that can automatically track participants in a conference venue.

This application provides the following technical solutions:

In one aspect, a video processing method is provided, comprising: acquiring a sensor frame captured by a video sensor, where the sensor frame is an image frame of an entire frame captured by a video sensor; detecting a target frame in the sensor frame, the The target frame is a human body image frame and/or an image frame containing a human body in the sensor frame; a visual field frame is determined according to the target frame; wherein, the visual field frame is an image frame including all the target frames; The target frame of the boundary of the visual field frame, and determine whether all the target frames that can determine the boundary of the visual field frame are stationary; when it is determined that all the target frames that can determine the boundary of the visual field frame are stationary , output the field of view frame.

In another aspect, a video processing device is provided, comprising: a video acquisition unit for acquiring a sensor frame captured by a video sensor, where the sensor frame is an image frame of an entire frame captured by the video sensor; a humanoid capture unit for A target frame in the sensor frame is detected, and the target frame is a human body image frame and/or an image frame containing a human body in the sensor frame; a video detection unit is used to determine a field of view frame according to the target frame; determining the target frame of the boundary of the field of view frame, and determining whether all the target frames that can determine the boundary of the field of view frame are still; wherein, the field of view frame is an image frame including all the target frames; The image processing unit outputs the view frame when it is determined that all the target frames that can determine the boundary of the view frame are stationary.

The beneficial effect of the present application is that a complete image is acquired through the sensor, and the human body in the sensor frame is detected to determine the image range that needs to be displayed to the user, that is, the field of view frame. When it is determined that all the people in the venue who have an influence on the output target frame are in a stationary state, the visual field frame is output and displayed. Because real-time monitoring is required for each sensor frame, the position changes of the participants at the venue can be captured in real time. When the movement of the target frame affects the boundary of the field of view frame, according to the solution of the present application, then The new field of view will be recalculated and output, thus enabling automatic, real-time tracking of participants in the venue.

Description of drawings

FIG. 1 is a system architecture diagram of an application of an embodiment of the present application.

FIG. 2 is a flowchart of a video processing method according to Embodiment 1 of the present application.

FIG. 3 is a flowchart of specific steps of determining a field of view frame according to a target frame in Embodiment 1 of the present application.

FIG. 4 is a schematic diagram of extending up and down all target frames in Embodiment 1 of the present application.

FIG. 5 is a schematic diagram of a field of view frame in Embodiment 1 of the present application.

FIG. 6 is a flowchart of determining a target frame that can determine the boundary of the field of view frame in Embodiment 1 of the present application.

FIG. 7 is a schematic diagram of cropping a sensor frame to obtain a field of view frame in Embodiment 1 of the present application.

FIG. 8 is a schematic diagram of smoothing a video image in Embodiment 1 of the present application.

FIG. 9 is a schematic block diagram of a video processing apparatus according to Embodiment 2 of the present application.

FIG. 10 is a schematic structural diagram of a video processing apparatus according to Embodiment 3 of the present application.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the embodiments described herein are only used to explain the present application, but not to limit the present application. However, the present application may be implemented in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that a thorough and complete understanding of the present disclosure is provided.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein in the specification of the application are for the purpose of describing particular embodiments only, and are not intended to limit the application.

It should be understood that the terms "system" or "network" are often used interchangeably herein. The term "and/or" in this article is only an association relationship to describe the associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and A and B exist independently B these three cases. In addition, the character "/" in this document generally indicates that the related objects are an "or" relationship.

The embodiments of the present application may be applied to various camera devices or systems, for example, a camera device, a network camera device, and a conference terminal of an audio-video conference, and the specific device or system is not limited by the embodiments of the present application.

Please refer to FIG. 1 , which shows a system architecture diagram 100 applied by an embodiment of the present application. The system architecture 100 includes: a camera device 110 , a main processing device 120 and a display device 130 . The camera device 110 , the main processing device 120 , and the display device 130 may be communicatively connected through one of electrical connection, network connection, communication connection, and the like. The camera device 110 includes a video sensor for acquiring sensor frames. After the main processing device 120 processes the sensor frames, the field of view frame is sent to the display device 130 for display.

The camera device 110, the main processing device 120 and the display device 130 may be three mutually independent hardware entities; alternatively, the camera device 110 and the main processing device 120 may be set in the same hardware entity, for example, in a camera device In addition to including a video sensor, it also includes a device for processing video images; alternatively, the main processing device 120 and the display device 130 may be set in the same hardware entity, for example, in the display device 130, in addition to including The display also includes a device for processing video images. The camera device 110 sends the acquired field of view frame to the display device 130, and the display device 130 processes the field of view frame before displaying it on the display. Specifically, the camera device 110 may be a camera, the display device 130 may be a display, a projector, a computer screen, etc., and the main processing device 120 may be a processing device built into the camera device 110 and the display device 130, or an independent processing device The processing device, such as a computer or other electronic device, such as a mobile intelligent electronic device, which can communicate with the device 110 and the display device 130, respectively.

In the conference scene, the meeting place is fixed. In small and medium-sized conference venues, the camera can use a high-definition wide-angle lens to obtain images of the entire venue. As a result, cameras can capture every participant in real time. Hereinafter, in this application, the image frame of the entire frame captured by the video sensor is referred to as the sensor frame, and the human body image frame and/or the image frame containing the human body in the sensor frame is referred to as the target frame, which will include all the above The image frame of the target frame is called the field of view frame. The technical solutions of the present application will be described below through specific embodiments.

Embodiment 1

Please refer to FIG. 2 , which shows a video processing method provided in Embodiment 1 of the present application. The method can be applied to the camera device 110 with video processing capability, can be applied to the display device 130 with video processing capability, and can also be applied to the independent main processing device 120 . The video processing method includes:

S210: Acquire a sensor frame captured by a video sensor, where the sensor frame is an image frame of an entire frame captured by the video sensor; optionally, acquire a sensor frame captured by a high-definition wide-angle camera, for example, the lens part of the camera adopts 4K The lens (5 million pixels or more) and a wide-angle lens, so that when there are more participants in a multi-person conference scene, it can also ensure that all participants are included in the view of the lens. In the range, it can also ensure the clarity of the video; the sensor in the camera mainly converts the optical signal received by the lens into an electrical signal, and then the electrical signal (ie the video signal) is transmitted to the real-time image frame. main processing device 120;

S220, detect a target frame in the sensor frame, where the target frame is a human body image frame and/or an image frame including a human body in the sensor frame; optionally, a method for detecting a human body includes but is not limited to face detection , upper body detection, lower body detection, human body pose estimation (SPPE, DensePose) and other methods; it should be noted that the human body referred to in this application may include the entire body of a person, or may refer to a part of the entire body, such as a face or upper body;

S230, determining a field of view frame according to the target frame; wherein, the field of view frame is an image frame including all the target frames;

S240, determine all the target frames that can determine the boundary of the field of view frame, and determine whether all the target frames that can determine the boundary of the field of view frame are static;

S250, when it is determined that all the target frames that can determine the boundary of the view frame are stationary, output the view frame. Optionally, after the output, the field of view frame may be directly displayed on the device running the method, or the field of view frame may be output to other display devices to display the field of view frame by means of wireless or limited transmission.

Referring to FIG. 3, optionally, S230, determining a visual field frame according to the target frame, including:

S231, expand the height of a certain proportion to the upper and lower parts of all the target frames;

Please refer to Fig. 4, expand the height of a certain proportion to the upper and lower of all target frames, such as e*H, e is the scale coefficient, and H is the height of the corresponding target frame;

S232, determine a minimum frame that can contain all the expanded target frames, which is the field of view frame;

Referring to Figure 5, draw a minimum frame View _O that can include all the expanded target frames.

Optionally, on the basis of the above steps S231 and S232, the range that needs to be displayed to the user is determined. However, the field of view frame at this time may not conform to the displayed size, or does not conform to the displayed length ratio, etc. Require. You can further adjust the visual field frame. Therefore, in S230, the visual field frame is determined according to the target frame, and may further include the following adjustment mode 1 and/or adjustment mode 2.

Please continue to view Figure 3, adjustment method 1: adjust the size of the field of view frame. That is, step S230 further includes:

S233, if the coordinates of the four vertices of the view frame exceed the maximum boundary coordinates of the view frame, replace the coordinates of the four vertices of the view frame with the maximum boundary coordinates; and/or,

S234, if the height value of the view frame is smaller than the minimum height value of the view frame, adjust the height value of the view frame to the minimum height value of the view frame; and/or,

S235 , if the width value of the view frame is smaller than the minimum width value of the view frame, adjust the width value of the view frame to the minimum width value of the view frame.

For example, the maximum value of the preset visual field frame is View _max and the minimum width and height are W _min and H _min respectively. Among them, View _max is generally predefined as the size of the sensor original image, W _min , H _min are set according to the local area of the sensor original image that needs to be enlarged, and the smaller W _min and H _min are set, the larger the local area that can be enlarged Small. Then, the coordinates of the visual field frame cannot exceed View _max , and the width/height values cannot be less than W _min /H _min , and the coordinates of the minimum frame View _O exceeding the boundary or insufficient are corrected. The field of view frame obtained after coordinate correction is marked as View _F .

The specific amendment rules are as follows:

The coordinates of the 4 points of the View _O box must be within the range of View _max coordinates, and the coordinates beyond the maximum boundary are replaced by the maximum boundary coordinates.

The width/height values of View _O must be greater than or equal to W _min /H _min . If the width/height of View _O is less than W _min /H _min , the width/height of View _O is supplemented to W _min /H _min .

Optionally, step S234 specifically includes: supplementing one-half of the difference between the minimum height value of the field of view frame and the height value of the field of view frame to the upper and lower boundaries of the field of view frame. If the upper boundary or lower boundary of the visual field frame exceeds the maximum boundary of the visual field frame, the coordinates beyond the maximum boundary are replaced by the maximum boundary coordinates, and the numerical value beyond the maximum boundary is supplemented to the opposite boundary.

Optionally, step S235 specifically includes: supplementing half of the difference between the minimum width value of the field of view frame and the width value of the field of view frame to the left and right boundaries of the field of view frame, if all subsequent additions are added. The left boundary or the right boundary of the visual field frame exceeds the maximum boundary of the visual field frame, then the coordinates beyond the maximum boundary are replaced with the coordinates of the maximum boundary, and the numerical value beyond the maximum boundary is added to the opposite side. boundary.

Please continue to view Figure 3, adjustment method 2: adjust the aspect ratio of the field of view frame. That is, step S230 further includes:

S236: Adjust the width value and/or the height value of the field of view frame according to the aspect ratio of the current video resolution. The field of view frame obtained after adjustment in step S236 is marked as View. In a preferred embodiment, the field of view frame is the field of view frame that is output and displayed to the user.

In the specific embodiment of the present application, the above-mentioned adjustment method 1 and adjustment method 2 of the field of view frame can be used either, or both adjustment methods can be used, first adjust the size with adjustment method 1, and then adjust with adjustment method 2 Aspect ratio.

The above steps S231 to S236 are summarized into a visual frame calculation function:

Among them, Rect ^ti is the set of target frames detected at time ti.

Referring to FIG. 6, optionally, in S240, determining the target frame that can determine the boundary of the field of view frame specifically includes:

S2411, calculating and obtaining a first field of view frame according to all target frames;

S2412, delete a described target frame;

S2413, calculating and obtaining a second field of view frame according to the remaining target frame;

S2414, when the first view frame and the second view frame are not equal, determine that the deleted target frame is the target frame that can determine the boundary of the view frame. The so-called first view frame is equal to the second view frame, and the boundary coordinates of the first view frame and the second view frame are the same or similar; the so-called first view frame and the second view frame are not equal, and the first At least one of the boundary coordinates of the view frame is different from the boundary coordinates of the second view frame.

Specifically, the following will be described in conjunction with the view frame calculation function, how S2411 to S2414 determine whether a target can determine the boundary of the view frame:

j∈1,2...n ^ti

Among them, a frame rect _j is removed from the target frame Rect ^ti monitored at time ti to obtain a new set

by

As a basis, a field of view frame is calculated

if

It means that the target frame rect _j will not affect the calculation result of the field of view frame, otherwise if

It means that the target frame rect _j will determine the boundary coordinates of the view frame. Take all the target frames in Rect ^ti that will determine the judgment of the view frame, and get DecisionRect ^ti . DecisionRect ^ti is the set of target boxes that can determine the boundary of the view frame View at time ti. Optionally, in S240, it is determined that all the target frames that can determine the boundary of the field of view frame are static, specifically including:

S242, if the motion factor of each target frame that can determine the boundary of the field of view frame within a preset time interval is smaller than a preset threshold, then determine that all the target frames that can determine the boundary of the field of view frame are all to a static state.

In the specific embodiment of the present application, the manner of determining the motion factor Factor ₁₂ of the target frame is as follows:

After the detection unit receives a sensor frame transmitted by the sensor, it will perform real-time detection on the sensor frame. First, the human body is detected, and the target frame containing the human body is framed, which is called target frame 1 here. Assuming that the upper left corner of the sensor frame is the coordinate origin (0,0), the coordinate of the center point C1 of the target frame 1 is calculated as (x1 ,y1), width W1, height H1, and the result will be saved.

Next, after the detection unit receives the next sensor frame transmitted by the sensor, it also performs real-time detection on the next sensor frame. Use the same method to frame the target frame 2 containing the human body, and save the coordinates of the center point C2 of the target frame 2 (x2, y2), the length W2, and the height H2.

Then follow the steps (1) to (5) below to calculate the motion factor:

(1) Calculate the square of the Euclidean distance of the center point: L _c =(x ₂ -x ₁ ) ² +(y ₂ -y ₁ ) ²

(2) Calculate the area of the target frame 1 S ₁ =W ₁ *H ₁

(3) Calculate the area of target frame 2 S ₂ =W ₂ *H ₂

(4) Considering that the size of target frame 1 and target frame 2 are different, calculate the absolute value of the product of the width difference value and the height difference value

M=|(W ₁ -W ₂ )*(H ₁ -H ₂ )|

(5) Calculate the motion factors of target frame 1 and target frame 2 Facter ₁₂ =(L _c +M)/(S ₁ +S ₂ ).

It should be noted that, in the specific implementation of the present application, it is only necessary to detect a human body, and it is not necessary to determine which specific person it is according to the image, but it can be determined within a limited time according to the target frame of the human body. The distance moved within the range to determine whether it is the same person.

The above is to calculate the motion factor Factor ₁₂ between two sensor frames (which can be the current frame and the previous frame, or the current frame and the next frame). When it is determined that the motion factor of T1 within a certain period of time is within a preset threshold range (such as less than or equal to the threshold), the target frame is determined to be static; when it is determined that the motion factor of T1 within a certain period of time exceeds (such as greater than) the target frame When the threshold value is reached, the target frame is determined to be in motion state. Among them, the threshold value of the motion factor can be taken as 0.5, which is an empirical value, which will be different under different conditions. The value of T1 ranges from 0 seconds to 10 seconds. If you need to focus on the person who is currently moving, as long as T1 is small enough.

Optionally, the image may also be cropped and/or scaled according to the field of view frame. Therefore, referring to FIG. 7 , S250 may specifically include:

S251, when it is determined that all the target frames that can determine the boundary of the field of view frame are stationary, crop and/or scale the sensor frame according to the field of view frame View, and output the cropped and/or scaled frame The field of view frame View _out . Optionally, the sensor frame is cropped and scaled by invoking an ISP (Image Signal Processor, image signal processor) chip.

As shown in Figure 7, crop the sensor frame according to the coordinates of the field of view frame View, and then scale the cropped field of view frame to the size of the current video output resolution (such as 1080P, 720P), and finally output the image that the user sees View _out . Using the ISP chip to process the clipping and scaling process can save about 50% of the CPU compared to the software algorithm processing, which greatly improves the chip performance.

Optionally, since the current visual field frame and the visual field frame of the video calculated in the steps of S230 and S240, there will be a certain difference in the coordinates, therefore, the output video image can also be smoothed. Therefore, S250 specifically Also includes:

S252, when it is determined that all the target frames that can determine the boundary of the field of view frame are stationary, calculate the difference coordinates between the target field of view frame and the current field of view frame;

S253, according to the preset maximum moving step size of the field of view frame of each frame of the image, calculate the number of moving steps from the current field of view frame to the target field of view frame;

S254: Update the field of view frame frame by frame according to the number of moving steps until the target field of view frame is reached.

Referring to FIG. 8, the above video image smoothing process is illustrated by an example. Assuming that the current field of view frame is View _cur , after the calculations in S231 to 236 , it is obtained that the target field of view frame is View _dst . The required moving distance between the current view frame and the target view frame is: View _dist =View _dst -View _cur .

In order to make the user see a smooth image, the field of view frame of each frame of image moves according to a fixed step size to avoid moving too fast. Assuming that the maximum step size of the coordinate value of the view frame is step _max , and the coordinate difference between the current view frame and the target view frame is View _dist = (x ₀ , y ₀ , x ₁ , y ₁ ), then the number of moving steps is:

MoveNum=max{x ₀ , y ₀ , x ₁ , y ₁ }/step _max .

Follow the steps below to update View _cur frame by frame until it reaches the target view frame View _dist :

While View _cur ≠View _dst :

View _step =View _dist /MoveNum

View _cur ←View _cur +View _step

That is, when the coordinates of View _cur and View _dst do not coincide, View _cur is moved, and the view frame updated each time is View _step until the current view frame View _cur reaches the target view frame View _dst .

The cropping and/or scaling processing and the video image smoothing processing in the above S2502 may be used together in practical applications, for example, the cropping and/or scaling processing is performed first, and then the video image smoothing processing is performed.

In the specific embodiment 1 of the present application, the image of the entire conference site is acquired by the sensor, and the human body in the sensor frame is detected to determine the image range that needs to be displayed to the user. And according to comparing the position change of the same target frame in each sensor frame, it is determined whether the picture frame is in a static state. When it is determined that all the people in the venue who have an influence on the output target frame have been in a still state, the visual field frame of the picture including all the human bodies is output and displayed. Since real-time monitoring is required for each sensor frame, even after the participants are seated, for some reason, the positions of the participants have changed. Later, it became very loose, or all the participants moved from the middle of the venue to one side of the venue, that is, the position space occupied by the participants in the venue changed, then according to the specific implementation of the present application The described video processing method can capture this change in real time, and after the participants are seated again, a new field of view frame is recalculated, output and displayed for the user to watch. Because the above method does not need to control the rotation of the camera or refocus, it just recalculates the sensor frame captured by the sensor to obtain a new field of view, and outputs and displays it to the user. Perform automatic, real-time tracking. Also, the apparatus using the method can thus also be a plug-and-play device.

Embodiment 2

Please refer to FIG. 9 , a video processing apparatus 300 provided in Embodiment 2 of the present application, the video processing apparatus includes:

The video acquisition unit 310 is configured to acquire the sensor frame captured by the video sensor, where the sensor frame is the image frame of the entire frame captured by the video sensor; optionally, the video acquisition unit 310 acquires the sensor frame captured by the high-definition wide-angle camera ;

A humanoid capture unit 320, configured to detect a target frame in the sensor frame, where the target frame is a human body image frame and/or an image frame containing a human body in the sensor frame;

The video detection unit 330 is configured to determine a visual field frame according to the target frame; determine all the target frames that can determine the boundary of the visual field frame, and determine whether all the target frames that can determine the boundary of the visual field frame are all Still; wherein, the field of view frame is an image frame including all the target frames;

The image processing unit 340 outputs the view frame when it is determined that all the target frames that can determine the boundary of the view frame are stationary.

Optionally, the video detection unit 330 is specifically configured to, when it is determined that the target frames are static, expand the heights of all the target frames up and down by a certain proportion; The smallest frame included in the target frame is the field of view frame. Please refer to the content of S231 in Embodiment 1 for the specific manner of expanding the height of the target frame at the top and bottom of the target frame by a certain proportion, which will not be repeated here.

Optionally, the video detection unit 330 is further configured to replace the four vertex coordinates of the view frame with the maximum boundary coordinates if the coordinates of the four vertices of the view frame exceed the maximum boundary coordinates of the view frame. and/or, if the height value of the field of view frame is less than the minimum height value of the field of view frame, then adjust the height value of the field of view frame to the minimum height value of the field of view frame; and/or, if the field of view If the width value of the frame is smaller than the minimum width value of the view frame, then adjust the width value of the view frame to the minimum width value of the view frame.

Optionally, the video detection unit 330 is specifically used for:

If the height value of the view frame is smaller than the minimum height value of the view frame, add half of the difference between the minimum width value of the view frame and the width value of the view frame to the view frame If the left or right boundary of the visual field frame after supplementation exceeds the maximum boundary of the visual field frame, the coordinates that exceed the maximum boundary will be replaced by the coordinates of the maximum boundary, and the coordinates that exceed the maximum boundary will be replaced at the same time. The value of the maximum boundary is supplemented to the opposite boundary; and/or,

If the width value of the view frame is smaller than the minimum width value of the view frame, add half of the difference between the minimum height value of the view frame and the height value of the view frame to the view frame If the upper and lower boundaries of the field of view frame after supplementation exceed the maximum boundary of the field of view frame, the coordinates that exceed the maximum boundary will be replaced by the maximum boundary coordinates, and the maximum boundary will be exceeded at the same time. The value of is added to the opposite boundary.

Optionally, the video detection unit 330 is further configured to adjust the width value and/or the height value of the field of view frame according to the aspect ratio of the current video resolution.

For a specific example of adjusting the field of view frame in the second embodiment, please refer to the detailed descriptions in S231 to 236 in the first embodiment, which will not be repeated here.

Optionally, the video detection unit 330, configured to determine the target frame that can determine the boundary of the field of view frame, includes:

The video detection unit 330 is specifically configured to calculate and obtain a first field of view frame according to all target frames; delete one of the target frames; calculate and obtain a second field of view frame according to the remaining target frames; When it is not equal to the second field of view frame, it is determined that the target frame to be deleted is the target frame that can determine the boundary of the field of view frame. Specifically, how the detection unit 330 determines by calculation whether a certain target frame is a target frame that can determine the boundary of the field of view frame, please refer to the description in Embodiment 1, which will not be repeated here.

Optionally, wherein the video detection unit 330 is configured to determine that all the target frames that can determine the boundary of the field of view frame are static, including:

The video detection unit 330 is specifically configured to determine all the target frames that can determine the boundary of the field of view when the motion factors within the preset time interval are all smaller than a preset threshold. The bounding boxes of the target boxes are in a stationary state. Specifically, how the video detection unit 330 determines whether a certain target frame is in a static state through calculation, please refer to the specific description in Embodiment 1, which will not be repeated here.

Optionally, the image processing unit 340 is specifically configured to, when it is determined that all the target frames that can determine the boundary of the field of view frame are stationary, crop and/or crop the sensor frame according to the field of view frame. or zoom, and output the field of view. specific,

Optionally, the image processing unit 340 is specifically configured to, when it is determined that all the target frames that can determine the boundary of the field of view frame are stationary, crop and/or crop the sensor frame according to the field of view frame. or zoom; and calculate the difference coordinates between the target field of view frame and the current field of view frame; according to the preset maximum movement step size of the field of view frame of each frame of image, calculate the movement from the current field of view frame to the target field of view The number of moving steps of the frame; the field of view frame is updated frame by frame according to the number of moving steps until the target field of view frame is reached. Specifically, how the image processing unit 340 gradually updates the current field of view frame until reaching the target field of view frame, please refer to the examples in S252 to S254 in the first embodiment, which will not be repeated here.

The video processing device 300 is a camera device with a built-in video processing function, such as the combination of the camera device 110 and the main processing device 120 in FIG. 1 ; it can also be a display device (such as a computer or an intelligent electronic device) with a built-in video processing function. , such as the combination of the main processing device 120 and the display device 130 in FIG. 1 ; it can also be an electronic device independent of hardware. Not limited in this application.

For the non-detailed parts in the second embodiment, please refer to the same or corresponding parts in the above-mentioned first embodiment, which will not be repeated here.

Embodiment 3

Please refer to FIG. 10 , which is a schematic structural diagram of a video processing apparatus 400 according to Embodiment 3 of the present application. The video processing apparatus 400 includes: a processor 410 , a memory 420 and a communication interface 430 . The processor 410, the memory 420 and the communication interface 430 are connected to each other through a bus system.

The processor 410 may be an independent component, or may be a collective term for multiple processing components. For example, it may be a CPU, an ASIC, or one or more integrated circuits configured to implement the above method, such as at least one microprocessor DSP, or at least one programmable gate FPGA, etc. The memory 420 is a computer-readable storage medium on which programs executable on the processor 410 are stored.

The processor 410 invokes the program in the memory 420 to execute a video processing method provided in the first embodiment, and transmits the result obtained by the processor 410 to other devices through the communication interface 430 in a wireless or wired manner.

Optionally, the video processing apparatus 400 may further include a camera 440 . The camera 440 acquires the sensor frame and sends it to the processing 410, and the processor 410 calls the program in the memory 420, executes the video processing method provided in the first embodiment above, and processes the sensor frame , and transmit the result to other devices through the communication interface 430 in a wireless or wired manner.

For details that are not detailed in the third embodiment, please refer to the same or corresponding parts in the above-mentioned first embodiment, which will not be repeated here.

Those skilled in the art should realize that, in one or more of the above examples, the functions described in the specific embodiments of the present application may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it may be implemented by a processor executing software instructions. The software instructions may consist of corresponding software modules. The software modules may be stored in a computer-readable storage medium, which may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more available mediums integrated. The available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, Digital Video Disc (DVD)), or semiconductor media (eg, Solid State Disk (SSD)) )Wait. The computer-readable storage medium includes but is not limited to random access memory (Random Access Memory, RAM), flash memory, read only memory (Read Only Memory, ROM), Erasable Programmable Read Only Memory (Erasable Programmable ROM, EPROM) ), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), registers, hard disks, removable hard disks, compact disks (CD-ROMs), or any other form of storage medium known in the art. An exemplary computer-readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the computer-readable storage medium. Of course, the computer-readable storage medium can also be an integral part of the processor. The processor and computer-readable storage medium may reside in an ASIC. Additionally, the ASIC may reside in access network equipment, target network equipment or core network equipment. Of course, the processor and the computer-readable storage medium may also exist as discrete components in the access network device, the target network device, or the core network device. When implemented in software, it can also be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer or a chip, which may include a processor, all or part of the processes or functions described in the specific embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer program instructions may be stored on or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, from a website site, computer, server, or computer-readable storage medium. The data center transmits to another website site, computer, server or data center by wired (eg coaxial cable, optical fiber, Digital Subscriber Line, DSL) or wireless (eg infrared, wireless, microwave, etc.).

The above-described embodiments illustrate but do not limit the present invention, and those skilled in the art can design multiple alternative examples within the scope of the claims. It should be appreciated by those skilled in the art that the application is not limited to the precise structures described above and illustrated in the accompanying drawings, but may be applied without departing from the scope of the invention as defined by the appended claims. Appropriate adjustments, modifications, equivalent substitutions, improvements, etc. are made to the specific implementation scheme. Therefore, any modifications and changes made in accordance with the concept and principles of the present invention are within the scope of the present invention as defined by the appended claims.

Claims

A video processing method, characterized in that the method comprises:

acquiring a sensor frame captured by the video sensor, where the sensor frame is an image frame of the entire frame captured by the video sensor;

detecting a target frame in the sensor frame, where the target frame is a human body image frame and/or an image frame containing a human body in the sensor frame;

A field of view frame is determined according to the target frame; wherein, the field of view frame is an image frame including all the target frames;

determining all the target boxes that can determine the boundaries of the field of view, and determining whether all the target boxes that can determine the boundaries of the field of view are stationary;

When it is determined that all the target frames that can determine the boundary of the viewing frame are stationary, the viewing frame is output.
The method of claim 1, wherein the determining a field of view frame according to the target frame comprises:

Expand the height of a certain proportion to the upper and lower parts of all the target frames;

A minimum frame that can contain all the expanded target frames is determined as the field of view frame.
The method of claim 2, wherein the determining a field of view frame according to the target frame further comprises:

If the coordinates of the four vertices of the view frame exceed the maximum boundary coordinates of the view frame, the maximum boundary coordinates are used to replace the coordinates of the four vertices of the view frame; and/or,

If the height value of the view frame is smaller than the minimum height value of the view frame, adjust the height value of the view frame to the minimum height value of the view frame; and/or,

If the width value of the view frame is smaller than the minimum width value of the view frame, the width value of the view frame is adjusted to the minimum width value of the view frame.
The method of claim 3, wherein:

The adjusting the width value of the view frame to the minimum width value of the view frame includes: adding half of the difference between the minimum width value of the view frame and the width value of the view frame to The left and right borders of the field of view frame, if the left border or right border of the field of view frame after supplementation exceeds the maximum border of the field of view frame, then the coordinates that exceed the maximum border are replaced by the coordinates of the maximum border, At the same time, values beyond said maximum boundary are supplemented to the opposite boundary; and/or,

The adjusting the height value of the field of view frame to the minimum height value of the field of view frame includes: adding half of the difference between the minimum height value of the field of view frame and the height value of the field of view frame to The upper and lower boundaries of the visual field frame, if the upper boundary or the lower boundary of the visual field frame after supplementation exceeds the maximum boundary of the visual field frame, then the coordinates that exceed the maximum boundary are replaced by the maximum boundary coordinates, and the coordinates that exceed the maximum boundary will be replaced. The value of the maximum boundary is supplemented to the opposite boundary.
The method according to any one of claims 2 to 4, wherein the determining a field of view frame according to the target frame further comprises:

Adjust the width value and/or the height value of the field of view frame according to the aspect ratio of the current video resolution.
The method of claim 1, wherein the determining the target frame that can determine the boundary of the field of view comprises:

Calculate the first field of view frame according to all target frames;

delete one of said target boxes;

A second field of view frame is obtained by calculating according to the remaining target frame;

When the first view frame and the second view frame are not equal, it is determined that the deleted target frame is the target frame that can determine the boundary of the view frame.
The method according to any one of claims 1 to 4 and 6, wherein the determining that all the target frames that can determine the boundary of the field of view frame are stationary, comprising:

If the motion factor of each target frame that can determine the boundary of the field of view frame within the preset time interval is less than a preset threshold, then determine that all the target frames that can determine the boundary of the field of view frame are static state.
The method according to any one of claims 1 to 4 and 6, wherein, when it is determined that all the target frames that can determine the boundary of the view frame are stationary, outputting the view frame includes:

When it is determined that all the target frames that can determine the boundary of the field of view are stationary, crop and/or zoom the sensor frame according to the field of view, and output the cropped and/or scaled field of view frame.
The method according to any one of claims 1 to 4 and 6, wherein, when it is determined that all the target frames that can determine the boundary of the view frame are stationary, outputting the view frame includes:

When it is determined that all the target frames that can determine the boundary of the view frame are stationary, calculate the difference coordinates between the target view frame and the current view frame;

Calculate the number of moving steps from the current field of view frame to the target field of view frame according to the preset maximum movement step size of the field of view frame of each frame of image;

The field of view frame is updated frame by frame according to the number of moving steps until the target field of view frame is reached.
A video processing device, wherein the device comprises:

a video acquisition unit, configured to acquire a sensor frame captured by the video sensor, where the sensor frame is an image frame of the entire frame captured by the video sensor;

A humanoid capture unit, configured to detect a target frame in the sensor frame, where the target frame is a human body image frame and/or an image frame containing a human body in the sensor frame;

A video detection unit, configured to determine a visual field frame according to the target frame; determine all the target frames that can determine the boundary of the visual field frame, and determine whether all the target frames that can determine the boundary of the visual field frame are still ; wherein, the field of view frame is an image frame including all the target frames;

The image processing unit outputs the view frame when it is determined that all the target frames that can determine the boundary of the view frame are stationary.
The apparatus according to claim 10, wherein the video detection unit is specifically configured to expand the height of all the target frames up and down by a certain proportion when it is determined that the target frames are still; The smallest frame included in all the expanded target frames is the field of view frame.
The apparatus according to claim 11, wherein the video detection unit is further configured to replace the field of view with the maximum boundary coordinates if the coordinates of the four vertices of the field of view frame exceed the maximum boundary coordinates of the field of view frame The coordinates of the four vertices of the frame; and/or, if the height value of the view frame is less than the minimum height value of the view frame, adjust the height value of the view frame to the minimum height value of the view frame; and/ Or, if the width value of the view frame is smaller than the minimum width value of the view frame, the width value of the view frame is adjusted to the minimum width value of the view frame.
The apparatus of claim 12, wherein the video detection unit is specifically configured to:

If the height value of the view frame is smaller than the minimum height value of the view frame, add half of the difference between the minimum width value of the view frame and the width value of the view frame to the view frame If the left or right border of the field of view frame after supplementation exceeds the maximum border of the field of view frame, the coordinates that exceed the maximum border will be replaced by the coordinates of the maximum border, and the coordinates that exceed the maximum border will be replaced at the same time. The value of the maximum boundary is supplemented to the opposite boundary; and/or,

If the width value of the view frame is smaller than the minimum width value of the view frame, add half of the difference between the minimum height value of the view frame and the height value of the view frame to the view frame If the upper and lower boundaries of the visual field frame after supplementation exceed the maximum boundary of the visual field frame, the coordinates that exceed the maximum boundary will be replaced by the maximum boundary coordinates, and the maximum boundary will be exceeded at the same time. The value of is added to the opposite boundary.
The apparatus according to any one of claims 11 to 13, wherein the video detection unit is further configured to adjust the width value and/or the height value of the field of view frame according to the aspect ratio of the current video resolution.
The apparatus of claim 10, wherein the video detection unit, for determining the target frame that can determine the boundary of the field of view frame, comprises:

The video detection unit is specifically configured to calculate and obtain a first field of view frame according to all target frames; delete one of the target frames; calculate and obtain a second field of view frame according to the remaining target frames; When the second field of view frames are not equal, it is determined that the target frame to be deleted is the target frame that can determine the boundary of the field of view frame.
The device according to any one of claims 10 to 13 and 15, wherein the video detection unit is configured to determine that all the target frames that can determine the boundary of the field of view frame are still, including:

The video detection unit is specifically configured to determine, when the motion factors of each target frame that can determine the boundary of the field of view frame within a preset time interval are less than a preset threshold, determine all the target frames that can determine the field of view frame. The bounding boxes are all stationary.
The device according to any one of claims 10 to 13 and 15, wherein the image processing unit is specifically configured to, when it is determined that all the target frames that can determine the boundary of the visual field frame are stationary, according to the The field of view frame crops and/or scales the sensor frame, and outputs the field of view frame.
The apparatus according to claim 17, wherein the image processing unit is specifically configured to, when it is determined that all the target frames that can determine the boundary of the field of view frame are stationary, analyze the sensor frame according to the field of view frame Crop and/or zoom; and calculate the difference coordinate between the target field of view frame and the current field of view frame; The number of moving steps of the target field of view frame; the field of view frame is updated frame by frame according to the number of moving steps until the target field of view frame is reached.