CN113849687A

CN113849687A - Video processing method and device

Info

Publication number: CN113849687A
Application number: CN202011319647.2A
Authority: CN
Inventors: 张志强; 王莽; 唐铭谦
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2021-12-28
Anticipated expiration: 2040-11-23
Also published as: CN113849687B

Abstract

The embodiment of the specification provides a video processing method and a video processing device, wherein the video processing method comprises the following steps: detecting a first object and a second object in a first video frame of a video to be processed; determining first position information of the first object in the first video frame and determining second position information of the second object in the first video frame; determining a target object based on the first location information and the second location information; and tracking the target object to determine a target video frame containing the target object in the video to be processed.

Description

Video processing method and device

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a video processing method. One or more embodiments of the present specification relate to a video processing apparatus, a computing device, and a computer-readable storage medium.

Background

With the development of the e-commerce, because the traditional e-commerce has the defects of over beautifying of commodities, incomplete commodity information, weak social contact and the like, a user cannot really know and know the commodities in an introduction mode of the commodities in the traditional e-commerce, but with the continuous rise of live video, a brand-new e-commerce mode is gradually formed: live broadcast telecommerce, sell commodity through the mode of live broadcast commodity promptly, in live broadcast video, the user can see the diversified show of commodity directly perceivedly, also can have comparatively clear understanding to the condition of trying of commodity.

However, a live video may introduce a plurality of commodities, and when a commodity is introduced live, many other commodities are also displayed in the live video at the same time, so that the identification and positioning of the live commodity by a user or a computer are influenced.

It is therefore desirable to provide a video processing method that can solve the above technical problems.

Disclosure of Invention

In view of this, the embodiments of the present specification provide a video processing method. One or more embodiments of the present disclosure are directed to a video processing apparatus, a computing device, and a computer-readable storage medium, which solve the technical problems of the prior art.

In a first aspect of embodiments of the present specification, there is provided a video processing method, including:

detecting a first object and a second object in a first video frame of a video to be processed;

determining first position information of the first object in the first video frame and determining second position information of the second object in the first video frame;

determining a target object based on the first location information and the second location information;

and tracking the target object to determine a target video frame containing the target object in the video to be processed.

According to a second aspect of embodiments herein, there is provided a video processing method including:

displaying a video input interface for a user based on a video processing request of the user;

receiving a video to be processed sent by the user based on the video input interface;

detecting a first object and a second object in a first video frame of the video to be processed;

and tracking the target object to determine a target video frame containing the target object in the video to be processed and returning the target video frame to the user.

According to a third aspect of embodiments herein, there is provided a video processing method including:

receiving a video processing request sent by a user, wherein the video processing request carries a video to be processed;

According to a fourth aspect of embodiments herein, there is provided a video processing method including:

determining a query keyword based on query data input by a user;

determining a to-be-processed video corresponding to the query keyword according to the query keyword;

determining a target object according to first position information of a first object and second position information of a second object in a first video frame of the video to be processed;

and tracking the target object to determine that a target video frame containing the target object in the video to be processed is displayed to the user.

In a fifth aspect of embodiments of the present specification, there is provided a video processing apparatus including:

a first object detection module configured to detect a first object and a second object in a first video frame of a video to be processed;

a first determine location module configured to determine first location information of the first object in the first video frame and to determine second location information of the second object in the first video frame;

a first determined object module configured to determine a target object based on the first location information and the second location information;

a first video frame determining module configured to track the target object to determine a target video frame containing the target object in the video to be processed.

In a sixth aspect of embodiments of the present specification, there is provided a video processing apparatus including:

an interface presentation module configured to present a video input interface for a user based on a video processing request of the user;

the video receiving module is configured to receive a video to be processed sent by the user based on the video input interface;

a second detection object module configured to detect a first object and a second object in a first video frame of the video to be processed;

a second determine location module configured to determine first location information of the first object in the first video frame and to determine second location information of the second object in the first video frame;

a second determined object module configured to determine a target object based on the first location information and the second location information;

and the second video frame determining module is configured to track the target object so as to determine a target video frame containing the target object in the video to be processed and return the target video frame to the user.

A seventh aspect of the embodiments of the present specification provides a video processing apparatus, including:

the device comprises a receiving request module, a processing request module and a processing module, wherein the receiving request module is configured to receive a video processing request sent by a user, and the video processing request carries a video to be processed;

a third object detection module configured to detect a first object and a second object in a first video frame of the video to be processed;

a third determine location module configured to determine first location information of the first object in the first video frame and to determine second location information of the second object in the first video frame;

a third determined object module configured to determine a target object based on the first location information and the second location information;

and the third video frame determining module is configured to track the target object so as to determine a target video frame containing the target object in the video to be processed and return the target video frame to the user.

In an eighth aspect of embodiments of the present specification, there is provided a video processing apparatus including:

a keyword determination module configured to determine a query keyword based on query data input by a user;

the video module to be processed is configured to determine a video to be processed corresponding to the query keyword according to the query keyword;

a fourth object determining module configured to determine a target object according to the first position information of the first object and the second position information of the second object in the first video frame of the video to be processed;

and the fourth video frame determining module is configured to track the target object so as to determine that a target video frame containing the target object in the video to be processed is displayed to the user.

According to a ninth aspect of embodiments herein, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, which when executed by the processor, implement the steps of the video processing method.

According to a tenth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the video processing method.

The present specification provides a video processing method, including: detecting a first object and a second object in a first video frame of a video to be processed; determining first position information of the first object in the first video frame and determining second position information of the second object in the first video frame; determining a target object based on the first location information and the second location information; and tracking the target object to determine a target video frame containing the target object in the video to be processed. According to the video processing method, the target object is determined according to the relative position of the first object and the second object in the first video frame of the video to be processed, the determination efficiency and accuracy of the target object are improved, the target object is tracked on the basis of determining the target object, identification of other objects in the video to be processed is omitted in the tracking process, and the tracking efficiency is improved.

Drawings

Fig. 1 is an exemplary diagram of a specific application scenario of a video processing method according to an embodiment of the present specification;

fig. 2 is a processing flow diagram of a first video processing method provided in one embodiment of the present specification;

fig. 3 is a structural diagram of a human body key point algorithm of a video processing method according to an embodiment of the present specification;

FIG. 4 is a block diagram of a tracking algorithm for a video processing method provided in one embodiment of the present description;

fig. 5 is a processing flow diagram of a video processing method applied to a commercial product live video scene according to an embodiment of the present specification;

fig. 6 is a flowchart of a second video processing method according to an embodiment of the present disclosure;

fig. 7 is a flowchart of a third video processing method provided in an embodiment of the present specification;

fig. 8 is a flowchart of a fourth video processing method according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram of a first video processing apparatus provided in an embodiment of the present specification;

fig. 10 is a schematic diagram of a second video processing apparatus provided in an embodiment of the present specification;

fig. 11 is a schematic diagram of a third video processing apparatus provided in an embodiment of the present specification;

fig. 12 is a schematic diagram of a fourth video processing apparatus provided in an embodiment of the present specification;

fig. 13 is a block diagram of a computing device according to an embodiment of the present disclosure.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

First, the noun terms to which one or more embodiments of the present specification relate are explained.

The anchor: referred to as a "moderator-type announcer". The simple understanding that the broadcast is in the process of reading others; the host is speaking himself. The anchor is a complex of the announcer and the host; for example, a director selling goods, that is, a person who displays and introduces goods in a live broadcast manner to achieve goods selling may be called the director.

And (3) cooperative detection: and (4) carrying out time sequence detection on a given target, and filtering out useless detection.

In this specification, a video processing method is provided, and one or more embodiments of the specification relate to a video processing apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.

Referring to fig. 1, fig. 1 illustrates an exemplary view of a specific application scenario of a video processing method according to an embodiment of the present specification.

The application scene of fig. 1 includes an anchor, a commodity and a pocket of the baby, where the anchor, the commodity and the pocket of the baby are display objects in a live video of the commodity, and a plurality of commodities, such as a commodity 1, a commodity 2, a commodity 3 and a commodity 4, are displayed in the live video of the commodity, but in the process of live broadcasting of the commodity, the anchor mainly introduces one commodity, that is, a main commodity in one time period.

Specifically, in the process of video processing of a live video of a commodity, a main broadcast in the live video of the commodity is used as a human body, human-centered detection is carried out, skeletal key points of the human body are tracked, the commodity 1 closest to the skeletal key points of the human body is determined as a main commodity introduced by the main broadcast according to the distance between the skeletal key points of the human body and the commodity, the live commodity 1 in the live video of the commodity is tracked, a video frame containing the commodity 1 in the live video of the commodity is determined, and then a video clip containing the commodity 1 is generated.

After the commodity 1 is determined to be a main commodity or a video frame of the commodity 1 is determined to be contained in a live video of the commodity, the commodity 1 can be compared with a main graph, an attached graph, a comment graph and the like of each commodity in a pocket of a baby, wherein the main graph, the attached graph, the comment graph and the like refer to a display interface or a page for displaying commodity information (including attribute information, purchase information and the like of the commodity) in the live video, the commodity 1 is searched in the pocket of the baby, so that the attribute information of the commodity 1 is obtained, the commodity detail information of the commodity 1 is determined, and the generated video clip is checked.

In addition, after the video frames of the commodity 1 are determined to be contained in the live video of the commodity, some wonderful video frames (such as detail display of the front, side and back of the commodity 1 and the like) can be selected from the video frames according to the display condition of the commodity 1 for secondary processing, and wonderful video clips for the commodity 1 are generated.

According to the video processing method provided by the embodiment of the specification, the main commodity which is in live broadcast is determined by detecting the distance between the commodity and the skeletal key point of the human body, and the main commodity is further tracked, so that the video frame containing the main commodity is determined, the main commodity in the live broadcast video is conveniently determined by the relative position of the commodity and the skeletal key point, the identification of other commodities in the live broadcast video is ignored, only the main commodity is tracked, the tracking efficiency of the main commodity is improved, and the auditing efficiency and the watching experience of the main commodity are increased by generating the video clip aiming at the main commodity.

Referring to fig. 2, fig. 2 shows a processing flow chart of a first video processing method provided in an embodiment of the present specification, which specifically includes the following steps:

step 202: a first object and a second object in a first video frame of a video to be processed are detected.

The video to be processed comprises but is not limited to live broadcast video, television video, movie video, animation video, entertainment video and the like; the first video frame may be any one of video frames in the video to be processed, or may also be a specific video frame in the video to be processed, such as a first frame of the video to be processed, or a video frame corresponding to a first occurrence of an object or a person in the video to be processed, which is not limited herein.

Correspondingly, the first object may be at least one article or commodity contained in the first video frame, such as cosmetics, electronic products, clothes, daily necessities, and the like; the second object may be a key point of a human body, or a part of a human body (for example, a hand, a head, a waist, etc.) included in the first video frame.

In practical application, when the first object is a commodity, the second object may be a hand key point, a human body, a hand, or the like, where the human body may be understood as a anchor, the hand key point may be understood as a hand key point of the anchor, and the hand may be understood as a hand of the anchor, and then the video to be processed may be a live video including the commodity and the anchor.

In specific implementation, when it is determined what the first object and the second object are, the methods for detecting (i.e., recognizing) the first object and the second object in the first video frame are various, for example, the first object or the second object is recognized by using a pre-trained object recognition model for the first object or the second object, or the first object or the second object in the first video frame may be detected by key feature matching, frame detection, or the like, without limitation.

Further, in an optional implementation manner provided by the embodiment of the present specification, when the first object is a commodity, the second object is a key point of a hand, and the video to be processed is a live video, the detecting the first object and the second object in the first video frame of the video to be processed is specifically implemented in the following manner:

carrying out article detection on the first video frame through a first detection algorithm to obtain the commodity;

detecting key points of the human body in the first video frame through a second detection algorithm to obtain human body key points of the human body;

determining the hand keypoints of the human keypoints.

The first detection algorithm refers to a detection algorithm for detecting the commodities in the video frame, such as an article feature detection algorithm, an article identification algorithm, a frame detection algorithm, and the like. The second detection algorithm is a detection algorithm for detecting a preset key point in a human body, and includes: human body key point detection algorithm and the like.

The human body key points refer to key points corresponding to important limbs or bones in a human body, such as head key points, neck key points, shoulder key points, elbow key points, hand key points (e.g., wrist key points, palm key points), and the like. In the process of detecting the key points, a human body in the first video frame needs to be detected, and the key points corresponding to each part of the human body are detected by taking the human body as the center.

In the embodiment of the description, the commodity and the hand key point in the first video frame are detected through the detection algorithm, and the commodity and the hand key point (namely, the area where the commodity and the hand key point are located) contained in the first video frame are detected, so that the commodity and the hand key point are accurately detected, and the detection efficiency is improved.

For convenience of understanding, in the embodiments of the present specification, the first object is taken as a commodity, the second object is taken as a hand key point, and the video to be processed is taken as a live video, and the video processing method is described in detail.

Step 204: first position information of the first object in the first video frame is determined, and second position information of the second object in the first video frame is determined.

Wherein, theThe first position information may be understood as coordinate information of the first object in the first video frame, for example, [ X [ ]₁，Y₁]The second position information may be understood as coordinate information of the second object in the first video frame, for example, [ X [ ]₂，Y₂]。

In a specific implementation, the first object in the first video frame may be located through the first detection algorithm, that is, coordinate information of each first object in the first video frame is determined, and the second object in the first video frame may be located through the second detection algorithm, that is, coordinate information of the second object in the first video frame is determined.

In a live-broadcast scene of a commodity, in a conventional visual-based commodity detection scheme, because the categories of the commodity are thousands of, it is obviously unreasonable to directly use a detector with categories, so that usually a few categories of detectors are used for detecting and processing all commodities in a time sequence (frame extraction can be performed in an indefinite length), and then characteristics of corresponding detection frames are extracted for searching in a commodity base to finally obtain category information of the commodity, and certainly, information of other modalities such as voice, text and the like can be considered to obtain the category together, but the method has obvious defects, firstly, video information has great redundancy, even if a video frame sequence of the frame extraction is used, many targets are detected and identified, so that the identification efficiency is very low, and the time for searching the main commodities is long; secondly, since a product is sold by anchor in a certain time period, but a plurality of other commodities are visually existed, the efficiency of the method becomes lower, and a plurality of false identifications also occur, and thirdly, the scheme has higher requirements on the performance and the stability of the detection and identification model.

In practical application, in a case that the second detection algorithm is a human key point detection algorithm, since performance of the human key point algorithm is faster on a public data set, referring to fig. 3, fig. 3 shows a structural diagram of the human key point algorithm of the video processing method provided in an embodiment of the present specification.

As shown in fig. 3, in the process of detecting key points of a human body, a first video frame is input into a key point detection network model, key points of the human body included in the first video frame are detected by the key point detection network model, all key points of the human body included in the human body are obtained, and a hand key point is determined from the key points of the human body, and the hand key point is further positioned;

the key point detection network model can be a high-resolution feature fusion basic network, a high-resolution feature fusion basic network is input with a first video frame, a high-resolution feature graph is always kept in the high-resolution feature fusion basic network, low-resolution convolution is gradually introduced, and the convolutions with different resolutions are connected in parallel. Meanwhile, the expression capacity of the high-resolution representation and the low-resolution representation is improved by continuously exchanging information (namely, fusing features) among the multi-resolution representations, and the multi-resolution representations are better promoted mutually, so that the key points of the human body are better identified, the key points of the hand in the key points of the human body are determined, and the coordinate information (namely, the second position information) corresponding to the key points of the hand is determined.

In addition, the key point detection network model may also be a regression network based on gaussian nuclear thermodynamic diagram, and the first video frame including the human body is input into the regression network based on gaussian nuclear thermodynamic diagram, so as to predict a gaussian heatmap of each key point of the human body, and further locate the key points of the hands based on the gaussian heatmap of the key points of the hands, so as to determine coordinate information (i.e., second position information) corresponding to the key points of the hands.

Further, in order to more accurately identify and locate the human key points, a high-resolution feature fusion basic network and a Gaussian kernel thermodynamic diagram-based regression network can be combined to jointly detect and locate the human key points, namely data amplification is performed in a detection stage (namely a prediction and inference stage), and the robustness of a human key point detection algorithm is also enhanced.

Step 206: determining a target object based on the first location information and the second location information.

The target object refers to a specific object in the first object, and in a live video scene, the target object may be understood as a main commodity being introduced by a main broadcaster.

In practical applications, the manner of determining the target object based on the first position information and the second position information is various, for example, the first object closest to the second object is determined as the target object based on the first position information and the second position information, and the first object above or below the second object may be determined as the target object according to the first position information and the second position information, which is not limited herein.

In a specific implementation, in an optional implementation manner provided by an embodiment of this specification, the determining a target object based on the first position information and the second position information is specifically implemented by:

determining a distance between the first object and the second object based on the first location information and the second location information;

determining the first object as the target object if the distance is less than a preset distance threshold.

The preset distance threshold refers to a preset distance value, so as to determine a first object of a second object within the distance value as a target object.

When there are a plurality of first objects, it is necessary to calculate based on the first position information and the second position information of each first object, obtain a distance between each first object and the second object, and determine the first object whose distance from the second object is smaller than a preset distance threshold as the target object.

In practical applications, there may be a problem of determining a plurality of target objects, in which case other multi-modal information in the video to be processed, such as voice information, text information, etc., may be combined to determine one target object among the plurality of target objects.

According to the embodiment of the specification, the target object is determined based on the relative position of the first object and the second object, the determination process for the target object is simplified, and the accuracy and the efficiency of identifying the target object are improved.

Step 208: and tracking the target object to determine a target video frame containing the target object in the video to be processed.

The tracking of the target object can be understood as detecting a video frame in the video to be processed according to the playing time sequence of the video to be processed, so as to track the target object in the video frame, and determine the video frame containing the target object, that is, the target video frame.

In specific implementation, in an optional implementation manner provided by this embodiment of the present specification, the first video frame is an ith video frame;

correspondingly, the target object is tracked to determine a target video frame containing the target object in the video to be processed, and the following method is specifically adopted:

determining a first image area according to the target object in the ith video frame, and determining a second image area in the i +1 video frame based on the first image area;

performing feature extraction on the first image area to obtain a first feature, and performing feature extraction on the second image area to obtain a second feature;

performing similarity calculation on the first feature and the second feature, and taking the ith video frame and the i +1 video frame as target video frames under the condition that the calculation result is greater than or equal to a similarity threshold value;

and increasing i by 1, and continuing to determine a first image area according to the target object in the ith video frame.

In specific implementation, a video to be processed includes n frames of video frames, where i belongs to [ 1, n ], and i and n are positive integers, then taking i as 1 as an example, specifically as shown in fig. 4, a tracking algorithm is adopted, and according to target position information of a target object in a first video frame (i.e., a previous frame), a first image region (i.e., a tracking region, which can be understood as what to be tracked) is obtained by cropping the first video frame, and a second image region (i.e., a search region, which can be understood as a region for searching the target object) is obtained by cropping a second video frame; and performing feature extraction on the first image area through a convolutional layer to obtain a first feature, performing feature extraction on the second video through the convolutional layer to obtain a second feature, performing feature comparison on the extracted first feature and the second feature through a full-link layer to determine a predicted position (predicted coordinate information) of the target object in the second image area, increasing i by 1, wherein i is 2, and continuing to perform the step of determining the first image area according to the target object in the second video frame (namely determining the first image area according to the predicted position of the target object).

The first image area refers to an image area containing a target object in the ith video frame; the second image region is an image region determined by enlarging an image region corresponding to the first image region in the i +1 th video frame according to a preset proportion (for example, 10%) or according to a preset range (for example, 1 cm is extended).

Correspondingly, the first feature refers to an image feature including the target object in the first image region, and the second feature refers to an image feature included in the second image region, and in particular, in implementation, the similarity calculation (i.e., feature comparison) is performed on the first feature and the second feature to determine whether the target object is included in the second image region.

The calculation result refers to a similarity value between the first feature and the second feature obtained by feature comparison, and specifically, the calculation result may be in a form of a numerical value or a percentage, which is not limited herein; the similarity threshold refers to a preset similarity value used for determining whether the first feature is similar to the second feature (i.e., whether the second feature includes the target object), and specifically, the similarity threshold may be set according to practical experience, which is not limited herein.

And under the condition that the calculation result is greater than or equal to the similarity threshold, the second image area contains the target object, the ith video frame and the (i + 1) th video frame are both taken as the target video frames containing the target object, and the detection of the video frame next to the (i + 1) th video frame is continued to determine whether the video frame next to the (i + 1) th video frame contains the target object.

It should be noted that the tracking method adopted in the embodiments of the present specification is a method based on feature similarity regression, and the tracking speed of the target object is fast, which is 30 to 40 times of that of other performance detections, so that the time consumption of the whole tracking process is also much less than that of the detection of the target product in the video frame.

In addition, the characteristic comparison is carried out by using the pixel blocks (image areas) of the previous and subsequent frames, the coordinate form of the target object is regressed, the single target cooperative detection is realized, and the speed and the precision can be better balanced compared with other methods.

In the embodiment of the specification, the target object is tracked by comparing the image characteristics of the specific areas in the two video frames, the continuity of visual information in time is fully utilized, and compared with the detection and identification of all commodities on each frame, only a fixed number of commodities (namely target commodities) are extracted from the whole tracking sequence for identification, so that the detection and identification number of the objects can be greatly reduced, the speed can be greatly increased, the problem of false identification caused by an identification model can be alleviated, the dependence on detection is weakened through tracking, and the tracking efficiency of the target object is improved.

In addition to the above-mentioned case that the calculation result is greater than or equal to the similarity threshold, there is also a case that the calculation result is smaller than the similarity threshold, and in an optional implementation manner provided by an embodiment of the present specification, the ith video frame is taken as the target video frame when the calculation result is smaller than the similarity threshold.

Specifically, when the calculation result is smaller than the similarity threshold, it indicates that the second image region does not include the target object, that is, the i +1 th video frame does not include the target object, and the i +1 th video frame is not taken as the target video frame, so that the target object is tracked, and the target video frame including the target object in the video to be processed is determined.

It should be noted that, when the i +1 th video frame does not include the target object, the tracking of the target object may be terminated, and in addition, tracking detection may be performed on several video frames after the i +1 th video frame again to avoid omission of the video frame including the target object, and if the video frame including the target object does not exist in the continuous process, the tracking of the target object may be completed.

Further, in the above process of tracking the target object, as the tracking is performed, a situation that the tracking of the target object is interrupted or a tracking error is continuously increased may occur, in this case, the tracked target object needs to be corrected in the tracking process to recover the tracking of the target object, reduce the error occurring in the tracking process, and also ensure the continuity and accuracy of the tracking, in an optional implementation manner provided in the embodiment of this specification, the video processing method further includes:

recording the tracking start time of the target object for the jth time, and detecting a first object and a second object in an ith video frame of a video to be processed under the condition that the time interval between the current time and the tracking start time of the jth time is greater than or equal to a preset time threshold, wherein the ith video frame is a video frame tracked by the current time;

determining first position information of a first object in the ith video frame and determining second position information of a second object in the ith video frame;

determining a target object in the ith video frame based on first position information of a first object in the ith video frame and second position information of a second object in the ith video frame;

and performing coincidence calculation according to the target position information of the target object in the ith video frame and the area position information of the first image area, taking the image area corresponding to the target position information as the first image area under the condition that the coincidence calculation result is smaller than a coincidence threshold, increasing the number j by 1, and recording the tracking start time of the target object for the jth time.

The current time can be the current time of the computing device, and the current time is incremented according to time units such as seconds or milliseconds; the preset time threshold refers to a preset time interval, such as 0.5 second, and specifically, the value may be set according to the actual scene needs and the tracking experience, which is not limited herein.

The j-th tracking start time includes a time when the target object is initially tracked and a time when the target object is re-tracked after being corrected.

In a specific implementation, the tracking process of the target object includes n times of recording the tracking time, where j ∈ [ 1, n ], and j, n are positive integers, then, taking i as an example, where j is also 1, before or after the step determines the first image region according to the target object in the first video frame, the 1 st tracking start time is recorded, and in a case where it is detected that a time interval from the current time to the recorded 1 st tracking start time is greater than or equal to a preset time threshold, that is, in a case where a specific time is separated from the system time recorded when the tracking of the target object is started, the target object in the i-th video frame tracked by the current time is re-determined (specifically, the implementation manner of determining the target object in the i-th video frame is similar to the specific implementation manner of determining the target object in the first video frame in steps 202 to 206 described above, refer to the above-mentioned specific implementation manner of determining the target object in the first video frame in step 202-step 206).

After determining the target object in the ith video frame, performing position comparison (for example, calculating the coincidence degree between image regions corresponding to two pieces of position information) on the target position information of the target object in the ith video frame and the region position information of the first image region in the ith video frame determined in the process of tracking the target object, indicating that the tracking of the target object is accurate when the coincidence degree calculation result (such as the coincidence degree value, the coincidence degree percentage and the like) is greater than or equal to the coincidence degree threshold (a threshold preset for distinguishing whether the coincidence occurs), and continuing to track the target object with the image region corresponding to the target object detected in the ith video frame as the first image region (i.e., returning to determining the second image region in the i +1 video frame based on the first image region) and recording the 2 nd tracking start time, the target position information of the target object in the video frame (namely the image area where the target object is located) is corrected at the specific interval time, and the tracking accuracy rate of the target object is improved.

In addition, in the tracking process, at preset time intervals, a first image area which is considered to contain the target object in the currently tracked video frame is compared with an image area corresponding to the target object in the first video frame by feature similarity, so as to determine whether tracking interruption or tracking error occurs to the target object in the tracking process, if the feature similarity is greater than or equal to a preset similarity threshold, it is indicated that tracking interruption or tracking error occurs to the target object, the target object in the current video frame is re-determined, and if the feature similarity is smaller than the preset similarity threshold, it is indicated that the tracking process is accurate, and no operation is required.

After the above-mentioned coincidence degree calculation is performed, there is a case where the coincidence degree calculation result is equal to or greater than the coincidence degree threshold, specifically:

and under the condition that the coincidence degree calculation result is greater than or equal to the coincidence degree threshold value, taking a target object in the ith video frame as a first target object, and tracking the first target object to determine a target video frame containing the first target object in the video to be processed.

Specifically, the first target object may be the same object as the target object in the first video frame, or may be a different object from the target object in the first video frame, which is not limited herein. In practical application, when the coincidence degree calculation result is greater than or equal to the coincidence degree threshold, the feature comparison may be further performed between the target object in the ith video frame and the target object in the first video frame to determine whether the target object in the ith video frame is the target object in the first video frame, if so, it indicates that the tracking of the target object in the first video frame is not finished, the first target object and the target object are the same object, and the target video frame including the first target object is also used as the target video frame including the target object; if not, the tracking of the target object in the first video frame is finished, the first target object is an object different from the target object, and the target video frame containing the first target object is a video clip belonging to the first target object.

Specifically, the target object in the ith video frame is taken as a first target object, tracking of the first target object is executed to determine the target video frame including the first target object in the video to be processed, if tracking of the target object has been lost or interrupted, tracking of the target object in the ith video frame (that is, the first target object) is restarted, specifically, tracking of the first target object is specifically realized, and with reference to the specific implementation of tracking of the target object in the first video frame, the specific implementation of tracking of the target object in the first video frame is not repeated herein.

In addition, on the basis of determining the target video frame including the target object, the target video segment including the target object is generated by segmenting the video to be processed, so that the user is prevented from consuming time to view other information in the video to be processed, and the viewing experience and viewing efficiency of the user for the target object are improved.

Sequencing the target video frames according to the playing sequence in the video to be processed;

and taking the target video frame arranged at the first position as an initial frame, taking the target video frame arranged at the last position as an end frame, and segmenting the video to be processed to obtain a target video segment of the target object.

Specifically, the target video frames are sequenced according to a playing sequence, that is, a playing sequence of the target video frames in the video to be processed is determined, the target video frame played first (the target video frame arranged at the first position) is used as an initial frame, the target video frame played last (the target video frame arranged at the last position) is used as an end frame, and the video to be processed is segmented to obtain a target video segment for the target object.

In an optional implementation manner provided by the embodiment of this specification, after obtaining the target video segment of the target object, the method further includes:

extracting a first target video frame in the target video clip according to a preset extraction rule;

comparing the characteristics of the first target video frame with preset object images in a preset object library to judge whether the target object is a preset object in the preset object library;

and determining that the target video clip is successfully audited under the condition that the target object is a preset object in the preset object library.

The preset extraction rule comprises random extraction, extraction with a specific number of video frames as frame extraction intervals, first frame extraction, last frame extraction, and/or extraction of video frames with certain characteristics.

The preset object library is a storage space for information of a preset object or a display page for the preset object, such as a commodity pocket in a live video scene, in practical applications, a preset object image of the preset object (such as a front view, a side view or a comment figure of the preset object) may be stored in the preset object library, by performing feature comparison on an image area of a target object and the preset object image, if features are similar (such as the similarity obtained by performing the feature comparison is greater than or equal to a preset similarity), it is determined that the review for the target video clip is successful, if features are not similar (such as the similarity obtained by performing the feature comparison is less than the preset similarity), it is determined that the review for the target video clip is failed, if the review fails, the target object may be marked or a non-preset object reminder may be performed, without limitation.

The preset object may be understood as an image of an object preset to be displayed in the video to be processed, for example, a preset commodity introduced in a live broadcast video in a live broadcast (that is, a live broadcast commodity provided by a merchant).

In the embodiment of the present specification, after the target video clip is obtained, frame extraction review is performed on the target video clip to determine whether a target object in the target video clip is a preset object in a preset object library, so as to prevent subsequent video processing on a non-preset object, and in addition, corresponding processing (such as notification or warning) performed based on a review result can also effectively inhibit recording of the non-preset object in a subsequent video recording process.

In addition, in a case that the target object is a preset object in the preset object library, in an optional implementation manner provided in an embodiment of this specification, the method further includes:

determining attribute information of the target object;

and auditing the target video frame based on the attribute information of the target object.

The attribute information may be understood as information such as a name, a size, a shape, a color, or a power of a preset object (i.e., a target object).

It should be noted that, while bringing convenience to people, live broadcast e-commerce brings convenience to people, a great deal of false information and false commodities appear in the live broadcast process, and the number of the false information and false commodities is huge, and then commodities introduced in the live broadcast process and relevant information of the commodities are audited based on attribute information of preset commodities, so that the reliability of the live broadcast commodities is guaranteed, and the shopping experience of shopping through live broadcast videos of users is improved.

In the embodiment of the description, the target video frame is checked through the attribute information of the target object, so that the reliability of the target object is improved, and the improper information of the target object in the video to be processed can be effectively restrained.

Further, on the basis of the obtaining of the target video segment, in an optional implementation manner provided by the foregoing embodiment of this specification, after the obtaining of the target video segment of the target object, the method further includes:

naming the target video clip based on the attribute information of the target object to obtain a video name corresponding to the target video clip;

receiving a query request of a user for the target video clip, wherein the query request carries a video name of the target video clip;

and inquiring the target video clip based on the video name, and returning the target video clip to the user.

Specifically, the target video clip is named based on the attribute information of the target object, representative attribute information such as object name, shape, efficacy and the like can be selected from the attribute information of the target object to be spliced to form the video name of the target video clip, the identification degree of the target object is improved through naming, a user can know the target video clip through the name and know the target object, the user can conveniently select whether to watch the video, the selection time of the user is saved, and the selection efficiency is improved.

In practical application, after the target video segment is named, the query of the user based on the video name can be received, so that the corresponding target video segment can be rapidly queried through the video name, and the query efficiency of the target video segment is increased.

In addition, on the basis of obtaining the target video clip, a video cover for the target video clip is further generated, so that a target object of the target video clip is known through the video cover, and the recognition efficiency for the target video clip is increased, where in an optional implementation provided by the embodiment of the specification, after obtaining the target video clip of the target object, the method further includes:

detecting the proportion of a target image area of the target object in the target video frame;

and screening the video frame with the largest proportion from the target video frames to serve as a video cover of the target video clip.

Specifically, the percentage may be understood as a ratio of an area of a target image region including a target object in a target video frame to an area of a target video view, for example, 50% or 60%, so as to screen a video frame with the largest percentage from the target video frame as a cover of the target video clip, so as to better identify the target video clip, facilitate quick visual recognition of the target video clip based on the video cover, and improve the recognition efficiency of the target video clip.

In practical applications, according to the area, height, width, or the like of a target image region including a target object, a video frame with the largest area, the highest height, or the widest width may be selected from the target video frames as a video cover of the target video clip, so as to highlight the target object in the video cover.

In addition, on the basis of determining a target video frame including the target object, some highlight video frames may be selected from the target object video frames to be combined, so as to generate a video highlight for the target object, in an optional implementation manner provided by an embodiment of this specification, after the tracking the target object to determine the target video frame including the target object in the video to be processed, the method further includes:

selecting a plurality of video frames meeting preset conditions from the target video frames, and combining the video frames meeting the preset conditions to generate the video collection aiming at the target object.

Specifically, the preset condition may be understood as a preset display condition for the target object in the target video frame, and specifically, the display condition may be a detail display (such as a material display, a size display, a price display, a different angle display, and the like) of the target object in the target video frame.

The video collection is formed by selecting the wonderful video frames aiming at the target object from the target video frames, namely, the video frames capable of highlighting the characteristics of the target object and connecting the video frames in series, so that redundant information in the target video segment is omitted through the video collection, the information of the target object is rapidly known, and the watching experience of a user is improved.

To sum up, in the video processing method provided in the embodiment of the present specification, the target object is determined according to the relative position of the first object and the second object in the first video frame in the video to be processed, so that the determination efficiency and accuracy of the target object are improved, the target object is tracked on the basis of determining the target object, and the identification of other objects in the video to be processed is omitted in the tracking process, so that the tracking efficiency is improved.

The following will further describe the video processing method with reference to fig. 5 by taking an application of the video processing method provided in this specification in a live video of a commercial product as an example. Fig. 5 shows a processing flow chart of a video processing method applied to a commercial live video scene according to an embodiment of the present specification, and specifically includes the following steps:

step 502: merchandise and hand key points in a first video frame of a live video are detected.

Specifically, the live video refers to a live video containing commodities and a main broadcast, and the commodities include: dress, cosmetics, and/or electronic product etc. hand key point, hand key point can be understood as the hand key point of the anchor that carries out the live broadcast of commodity in the first video frame, and is specific, hand key point includes: left/right wrist keypoints, individual finger keypoints, left/right palm center keypoints, and the like.

Step 504: determining first position information of a commodity in the first video frame, and determining second position information of the hand key point in the first video frame.

In practical application, the hand key points can be positioned by a human key point detection algorithm with superior performance.

Step 506: determining a distance between the item and the hand keypoint based on the first location information and the second location information.

Specifically, when a plurality of commodities are provided, the distance between each commodity and the hand key point is determined according to the first position information of each commodity and the second position information of the hand key point.

Step 508: and determining the commodities with the distance smaller than a preset distance threshold value as main commodities in the live video.

Specifically, the main commodity is a commodity introduced by a live broadcast mode by a main broadcast in a first video frame, and the main commodity is determined according to the relative position of the determined commodity and a hand key point, so that the commodity explained by the main broadcast can be more accurately judged, and the interference of other commodities is removed.

In addition, the main commodity can be determined by using a simpler position relationship between the human and the commodity and the relative position of the commodity in the first video frame based on human body detection and commodity detection, or the hand key points of the human body can be replaced based on the position relationship between the detected hand position and the commodity.

Step 510: and tracking the main commodity to determine a target video frame containing the main commodity in a live video.

Step 512: and sequencing the target video frames according to the playing sequence in the live video.

Step 514: and taking the target video frame arranged at the first position as an initial frame, taking the target video frame arranged at the last position as an end frame, and segmenting the live video to obtain a target video clip of the main commodity.

Step 516: and extracting a first target video frame in the target video clip according to a preset extraction rule.

Step 518: comparing the characteristics of the first target video frame with preset commodity images in a preset commodity library to judge whether the main commodity is a preset commodity in the preset commodity library;

step 520: and determining that the target video clip is successfully audited under the condition that the main commodity is a preset commodity in the preset commodity library.

To sum up, the video processing method provided in the embodiment of the present specification determines a main commodity being live broadcast by detecting a distance between a commodity and a hand key point in a first video frame of a live video, improves determination efficiency and accuracy for determining the main commodity, and further tracks the main commodity on the basis of determining the main commodity, thereby determining a video frame including the main commodity, realizes that only the main commodity is tracked by ignoring identification of other commodities in the live video, and improves tracking efficiency of the main commodity.

Referring to fig. 6, fig. 6 is a flowchart illustrating a second video processing method according to an embodiment of the present specification, which specifically includes the following steps:

step 602: a video input interface is presented for a user based on the user's video processing request.

Step 604: and receiving the video to be processed sent by the user based on the video input interface.

Step 606: detecting a first object and a second object in a first video frame of the video to be processed.

Step 608: first position information of the first object in the first video frame is determined, and second position information of the second object in the first video frame is determined.

Step 610: determining a target object based on the first location information and the second location information.

Step 612: and tracking the target object to determine a target video frame containing the target object in the video to be processed and returning the target video frame to the user.

In the embodiment of the present specification, after receiving a to-be-processed video sent by a user, the video processing method determines a target object according to a relative position of a first object and a second object in a first video frame of the to-be-processed video, so that the efficiency and accuracy of determining the target object are improved, tracks the target object on the basis of determining the target object, ignores identification of other objects in the to-be-processed video in the tracking process, and improves the tracking efficiency.

The foregoing is a schematic scheme of the second video processing method of the present embodiment. It should be noted that the technical solution of the second video processing method belongs to the same concept as the technical solution of the first video processing method, and details of the technical solution of the second video processing method, which are not described in detail, can be referred to the description of the technical solution of the first video processing method.

Referring to fig. 7, fig. 7 is a flowchart illustrating a third video processing method according to an embodiment of the present disclosure, which specifically includes the following steps:

step 702: receiving a video processing request sent by a user, wherein the video processing request carries a video to be processed.

Step 704: detecting a first object and a second object in a first video frame of the video to be processed.

Step 706: first position information of the first object in the first video frame is determined, and second position information of the second object in the first video frame is determined.

Step 708: determining a target object based on the first location information and the second location information.

Step 710: and tracking the target object to determine a target video frame containing the target object in the video to be processed and returning the target video frame to the user.

In the embodiment of the present description, after receiving a video processing request for a video to be processed, the video processing method determines a target object according to a relative position of a first object and a second object in a first video frame of the video to be processed, so as to improve the efficiency and accuracy of determining the target object, tracks the target object on the basis of determining the target object, ignores identification of other objects in the video to be processed in a tracking process, and improves tracking efficiency.

The above is a schematic scheme of the third video processing method of the present embodiment. It should be noted that the technical solution of the third video processing method belongs to the same concept as the technical solution of the first video processing method, and details that are not described in detail in the technical solution of the third video processing method can be referred to the description of the technical solution of the first video processing method.

Referring to fig. 8, fig. 8 is a flowchart illustrating a fourth video processing method according to an embodiment of the present disclosure, which specifically includes the following steps:

step 802: based on query data input by a user, query keywords are determined.

Specifically, the query keyword refers to a keyword extracted or identified from query data, and is used for performing video query.

Step 804: and determining the video to be processed corresponding to the query keyword according to the query keyword.

In practical application, keywords (or tags) carried in a video can be predetermined, after the query keywords are determined, the query keywords can be compared with keywords corresponding to the video, and if keywords matched with the query keywords exist in the keywords corresponding to the video, the video is determined as a to-be-processed video corresponding to the query keywords.

Step 806: and determining a target object according to the first position information of the first object and the second position information of the second object in the first video frame of the video to be processed.

Optionally, the determining a target object according to the first position information of the first object and the second position information of the second object in the first video frame of the video to be processed includes:

determining the target object based on the first location information and the second location information.

Step 808: and tracking the target object to determine that a target video frame containing the target object in the video to be processed is displayed to the user.

Optionally, the tracking the target object to determine that a target video frame containing the target object in the video to be processed is displayed to the user includes:

tracking the target object to determine a target video frame containing the target object in the video to be processed;

and extracting a display video frame from the target video frame according to a first preset extraction rule, and displaying the display video frame to the user.

Specifically, the first preset extraction rule includes an extraction rule of randomly extracting, extracting a first frame, extracting a last frame, or extracting a video frame with a largest target object ratio, and the like, and is not limited herein, so that the display video frame is extracted from the target video frame according to the first preset extraction rule and is displayed to the user as a query result.

In the embodiment of the description, the video processing method determines a to-be-processed video corresponding to a query keyword based on query data input by a user, determines a target object according to a relative position of a first object and a second object in a first video frame of the to-be-processed video, improves determination efficiency and accuracy of the target object, tracks the target object on the basis of determining the target object, ignores identification of other objects in the to-be-processed video in a tracking process, improves tracking efficiency, shows the target video frame containing the target object as a query result to the user, and improves query experience of the user.

The foregoing is a schematic scheme of the fourth video processing method of the present embodiment. It should be noted that the technical solution of the fourth video processing method belongs to the same concept as the technical solution of the first video processing method, and details of the technical solution of the fourth video processing method, which are not described in detail, can be referred to the description of the technical solution of the first video processing method.

Corresponding to the above method embodiment, this specification further provides an embodiment of a video processing apparatus, and fig. 9 shows a schematic diagram of a first video processing apparatus provided in an embodiment of this specification. As shown in fig. 9, the apparatus includes:

a first object detection module 902 configured to detect a first object and a second object in a first video frame of a video to be processed;

a first determine location module 904 configured to determine first location information of the first object in the first video frame and to determine second location information of the second object in the first video frame;

a first determine object module 906 configured to determine a target object based on the first location information and the second location information;

a first determine video frame module 908 configured to track the target object to determine a target video frame containing the target object in the to-be-processed video.

Optionally, the first determine object module 906 is further configured to:

Optionally, the first video frame is an ith video frame;

accordingly, the first determine video frame module 908 is further configured to:

Optionally, the first determine video frame module 908 is further configured to:

and taking the ith video frame as a target video frame when the calculation result is smaller than the similarity threshold value.

Optionally, the apparatus further includes:

the recording time module is configured to record a jth tracking start time of the target object, and execute detection of a first object and a second object in an ith video frame of a video to be processed under the condition that a time interval between a current time and the jth tracking start time is detected to be greater than or equal to a preset time threshold, wherein the ith video frame is a video frame tracked in the video to be processed by the current time;

a fourth determine location module configured to determine first location information of a first object in the ith video frame and to determine second location information of a second object in the ith video frame;

a determine object module configured to determine a target object in the ith video frame based on first position information of a first object in the ith video frame and second position information of a second object in the ith video frame;

and the coincidence degree calculation module is configured to perform coincidence degree calculation according to the target position information of the target object in the ith video frame and the area position information of the first image area, and when the coincidence degree calculation result is smaller than a coincidence degree threshold value, the image area corresponding to the target position information is used as the first image area, j is increased by 1, and the j-th tracking start time of the target object is recorded.

Optionally, the apparatus further includes:

and the tracking module is configured to take a target object in the ith video frame as a first target object and track the first target object to determine a target video frame containing the first target object in the video to be processed under the condition that the coincidence calculation result is greater than or equal to the coincidence threshold value.

Optionally, the apparatus further includes:

the sequencing module is configured to sequence the target video frames according to a playing sequence in the video to be processed;

and the segmentation module is configured to segment the to-be-processed video to obtain a target video segment of the target object by taking the target video frame ranked at the first position as an initial frame and the target video frame ranked at the last position as an end frame.

Optionally, the apparatus further includes:

the extraction module is configured to extract a first target video frame in the target video segment according to a preset extraction rule;

the characteristic comparison module is configured to compare characteristics of the first target video frame with preset object images in a preset object library, and judge whether the target object is a preset object in the preset object library;

and the verification success determining module is configured to determine that verification is successful on the target video clip when the target object is a preset object in the preset object library.

Optionally, the apparatus further includes:

the attribute determining module is configured to determine attribute information of the target object under the condition that the target object is a preset object in the preset object library;

an auditing module configured to audit the target video frame based on attribute information of the target object.

Optionally, the apparatus further includes:

the naming module is configured to name the target video clip based on the attribute information of the target object to obtain a video name corresponding to the target video clip;

a query request receiving module configured to receive a query request of a user for the target video segment, wherein the query request carries a video name of the target video segment;

a query module configured to query the target video segment based on the video name and return the target video segment to the user.

Optionally, the apparatus further includes:

a detection proportion module configured to detect proportion of a target image area of the target object in the target video frame;

a screening module configured to screen the video frame with the largest proportion from the target video frames as a video cover of the target video clip.

Optionally, the apparatus further includes:

the selection module is configured to select a plurality of video frames meeting preset conditions from target video frames, combine the video frames meeting the preset conditions, and generate a video collection aiming at the target object.

Optionally, the first object is a commodity, and the second object includes: the video to be processed is live broadcast video.

Optionally, in a case that the second object is the hand key point, the first object detection module 902 is further configured to:

a first detection module configured to perform article detection on the first video frame through a first detection algorithm to obtain the commodity;

the second detection module is configured to perform key point detection on the human body in the first video frame through a second detection algorithm to obtain human body key points of the human body;

a determine keypoints module configured to determine the hand keypoints of the human keypoints.

In summary, the video processing apparatus provided in the embodiment of the present specification determines the target object by the relative position of the first object and the second object in the first video frame in the video to be processed, so as to improve the efficiency and accuracy of determining the target object, tracks the target object on the basis of determining the target object, and ignores the identification of other objects in the video to be processed in the tracking process, so as to improve the tracking efficiency.

The above is a schematic configuration of the first video processing apparatus of the present embodiment. It should be noted that the technical solution of the video processing apparatus belongs to the same concept as that of the first video processing method, and details of the technical solution of the video processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the first video processing method.

Corresponding to the above method embodiment, this specification also provides an embodiment of a video processing apparatus, and fig. 10 shows a schematic diagram of a second video processing apparatus provided in an embodiment of this specification. As shown in fig. 10, the apparatus includes:

an interface presentation module 1002 configured to present a video input interface for a user based on a video processing request of the user;

a video receiving module 1004 configured to receive the video to be processed sent by the user based on the video input interface;

a second detected object module 1006 configured to detect a first object and a second object in a first video frame of the video to be processed;

a second determine location module 1008 configured to determine first location information of the first object in the first video frame and to determine second location information of the second object in the first video frame;

a second determine object module 1010 configured to determine a target object based on the first location information and the second location information;

a second determine video frame module 1012 configured to track the target object to determine a target video frame containing the target object in the to-be-processed video and return the target video frame to the user.

In the embodiment of the present specification, after receiving a video to be processed sent by a user, the video processing apparatus determines a target object according to a relative position of a first object and a second object in a first video frame of the video to be processed, so as to improve the efficiency and accuracy of determining the target object, tracks the target object on the basis of determining the target object, and ignores identification of other objects in the video to be processed in a tracking process, thereby improving the tracking efficiency.

The above is a schematic configuration of the second video processing apparatus of the present embodiment. It should be noted that the technical solution of the video processing apparatus and the technical solution of the second video processing method belong to the same concept, and details that are not described in detail in the technical solution of the video processing apparatus can be referred to the description of the technical solution of the second video processing method.

Corresponding to the above method embodiment, this specification further provides an embodiment of a video processing apparatus, and fig. 11 shows a schematic diagram of a third video processing apparatus provided in an embodiment of this specification. As shown in fig. 11, the apparatus includes:

a receiving request module 1102 configured to receive a video processing request sent by a user, where the video processing request carries a video to be processed;

a third detected object module 1104 configured to detect a first object and a second object in a first video frame of the video to be processed;

a third determine location module 1106 configured to determine first location information of the first object in the first video frame and to determine second location information of the second object in the first video frame;

a third determine object module 1108 configured to determine a target object based on the first location information and the second location information;

a third video frame determining module 1110, configured to track the target object, to determine a target video frame containing the target object in the to-be-processed video, and return the target video frame to the user.

The above is a schematic arrangement of the third video processing apparatus of the present embodiment. It should be noted that the technical solution of the video processing apparatus and the technical solution of the third video processing method belong to the same concept, and details that are not described in detail in the technical solution of the video processing apparatus can be referred to the description of the technical solution of the third video processing method.

Corresponding to the above method embodiment, this specification further provides a video processing apparatus embodiment, and fig. 12 shows a schematic diagram of a fourth video processing apparatus provided in an embodiment of this specification. As shown in fig. 12, the apparatus includes:

a determine keywords module 1202 configured to determine query keywords based on query data input by a user;

a to-be-processed video determining module 1204, configured to determine, according to the query keyword, a to-be-processed video corresponding to the query keyword;

a fourth object determining module 1206, configured to determine a target object according to the first position information of the first object and the second position information of the second object in the first video frame of the video to be processed;

a fourth video frame determining module 1208, configured to track the target object, so as to determine that a target video frame containing the target object in the to-be-processed video is displayed to the user.

Optionally, the fourth determine object module 1206 is further configured to:

Optionally, the fourth determine video frame module 1208 is further configured to:

The foregoing is a schematic configuration of the fourth video processing apparatus of the present embodiment. It should be noted that the technical solution of the video processing apparatus and the technical solution of the fourth video processing method belong to the same concept, and details that are not described in detail in the technical solution of the video processing apparatus can be referred to the description of the technical solution of the fourth video processing method.

FIG. 13 illustrates a block diagram of a computing device 1300 provided according to one embodiment of the present description. The components of the computing device 1300 include, but are not limited to, a memory 1310 and a processor 1320. The processor 1320 is coupled to the memory 1310 via the bus 1330, and the database 1350 is used to store data.

Computing device 1300 also includes access device 1340, access device 1340 enables computing device 1300 to communicate via one or more networks 1360. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 1340 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 1300 and other components not shown in FIG. 13 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 13 is for purposes of example only and is not limiting as to the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 1300 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 1300 can also be a mobile or stationary server.

Wherein the processor 1320 is configured to execute computer-executable instructions that when executed by the processor implement the steps of the video processing method.

The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the video processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the video processing method.

An embodiment of the present specification further provides a computer readable storage medium storing computer instructions, which when executed by a processor, implement the steps of the video processing method.

The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the above-mentioned video processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above-mentioned video processing method.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Further, those skilled in the art should also appreciate that the embodiments described in this specification are preferred embodiments and that acts and modules referred to are not necessarily required for an embodiment of the specification.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the embodiments. The specification is limited only by the claims and their full scope and equivalents.

Claims

1. A video processing method, comprising:

2. The video processing method of claim 1, the determining a target object based on the first location information and the second location information, comprising:

3. The video processing method of claim 1, the first video frame being an ith video frame;

correspondingly, the tracking the target object to determine a target video frame containing the target object in the video to be processed includes:

4. The video processing method of claim 3, after calculating the similarity of the first feature to the second feature, further comprising:

5. The video processing method of claim 3, further comprising:

recording the tracking start time of the target object for the jth time, and detecting a first object and a second object in an ith video frame of a video to be processed under the condition that the time interval between the current time and the tracking start time of the jth time is greater than or equal to a preset time threshold, wherein the ith video frame is a video frame tracked in the video to be processed by the current time;

6. The video processing method according to claim 5, further comprising, after performing the overlap ratio calculation according to the target position information of the target object in the ith video frame and the region position information of the first image region:

7. The video processing method according to claim 1, after tracking the target object to determine a target video frame containing the target object in the video to be processed, further comprising:

8. The video processing method of claim 7, after obtaining the target video segment of the target object, further comprising:

9. The video processing method according to claim 8, wherein said determining whether the target object is a preset object in the preset object library further comprises:

determining attribute information of the target object under the condition that the target object is a preset object in the preset object library;

and auditing the target video clip based on the attribute information of the target object.

10. The video processing method of claim 7, after obtaining the target video segment of the target object, further comprising:

11. The video processing method of claim 7, after obtaining the target video segment of the target object, further comprising:

12. The video processing method according to claim 1, after tracking the target object to determine a target video frame containing the target object in the video to be processed, further comprising:

13. The video processing method according to any one of claims 1 to 12, wherein the first object is a commercial product, and the second object includes: the video to be processed is live broadcast video.

14. The video processing method according to claim 13, wherein in a case where the second object is the hand key point, the detecting a first object and a second object in a first video frame of the video to be processed includes:

determining the hand keypoints of the human keypoints.

15. A video processing method, comprising:

16. A video processing method, comprising:

17. A video processing method, comprising:

determining a query keyword based on query data input by a user;

18. The video processing method of claim 17, wherein the determining a target object according to the first position information of the first object and the second position information of the second object in the first video frame of the video to be processed comprises:

19. The video processing method of claim 17, wherein the tracking the target object to determine that a target video frame containing the target object in the video to be processed is presented to the user comprises:

20. A video processing apparatus comprising:

21. A video processing apparatus comprising:

22. A video processing apparatus comprising:

23. A video processing apparatus comprising:

24. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, which when executed by the processor implement the steps of the video processing method of any of claims 1-14, 15, 16, 17-19.

25. A computer readable storage medium storing computer instructions which, when executed by a processor, carry out the steps of the video processing method of any of claims 1-14, 15, 16, 17-19.