CN110084835B

CN110084835B - Method and apparatus for processing video

Info

Publication number: CN110084835B
Application number: CN201910493426.8A
Authority: CN
Inventors: 陈奇
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Douyin Vision Co Ltd; Douyin Vision Beijing Co Ltd
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2020-08-21
Anticipated expiration: 2039-06-06
Also published as: CN110084835A

Abstract

Embodiments of the present disclosure disclose methods and apparatus for processing video. One embodiment of the method comprises: acquiring an image area selected by a user from a target video frame of a target video as a target image area; based on the target video frame and the target image area, performing the steps of: acquiring candidate video frames from a target video; determining at least two candidate image areas with the same size as the target image area from the candidate video frames; respectively matching at least two candidate image areas and a target image area to obtain matching results; determining a result image region matched with the target image region from the at least two candidate image regions based on the matching result; in response to determining that the target video does not include a video frame that follows the candidate video frame, the target image region and the resulting image region are determined to be the pending image region of the target video. The method and the device can track various objects in the video, and improve the tracking universality.

Description

Method and apparatus for processing video

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and an apparatus for processing a video.

Background

Currently, for a shot video, some processing (e.g., adding a special effect) is sometimes required for a certain object in the video. When an object needing to be processed continuously appears in the video, the object can be tracked so as to realize uniform processing of the object in the video.

In the prior art, tracking of an object in a video is generally tracking of a predetermined target object, such as tracking of a human face object, tracking of a vehicle object, and the like. In practice, tracking for a target object is typically achieved based on features of the target object (e.g., facial features).

Disclosure of Invention

Embodiments of the present disclosure propose methods and apparatuses for processing video.

In a first aspect, an embodiment of the present disclosure provides a method for processing video, the method including: acquiring an image area selected by a user from a target video frame of a target video as a target image area; based on the target video frame and the target image area, performing the following tracking steps: acquiring a candidate video frame from a target video, wherein the candidate video frame is a video frame which is adjacent to the target video frame and is positioned behind the target video frame; determining at least two image areas with the same size as the target image area from the candidate video frames as candidate image areas; respectively matching at least two candidate image areas and a target image area to obtain matching results; determining a candidate image region matching the target image region from among the at least two candidate image regions as a result image region based on the matching result; determining whether the target video includes a video frame that follows the candidate video frame; in response to determining not to include, determining the target image area and the result image area as image areas to be processed of the target video.

In some embodiments, determining at least two image regions of the same size as the target image region from the candidate video frames as candidate image regions comprises: determining an enlarged image area for the target image area from the candidate video frame, wherein the enlarged image area comprises an image area corresponding to the target image area; at least two image regions having the same size as the target image region are determined as candidate image regions from the enlarged image region.

In some embodiments, matching the at least two candidate image regions and the target image region respectively, and obtaining the matching result includes: respectively extracting image characteristics of the target image area and the determined at least two candidate image areas; for the candidate image area in the determined at least two candidate image areas, carrying out correlation operation on the image characteristics of the candidate image area and the image characteristics of the target image area to obtain correlation degree as a matching result, wherein the correlation degree is used for representing the correlation degree of the candidate image area and the target image area; and determining, as the result image region, a candidate image region that matches the target image region from among the at least two candidate image regions based on the matching result includes: based on the correlation, a candidate image region related to the target image region is determined as a result image region from among the at least two candidate image regions.

In some embodiments, extracting image features of the target image region and the determined at least two candidate image regions, respectively, comprises: and respectively extracting directional gradient histogram characteristics of the target image area and the determined at least two candidate image areas.

In some embodiments, the method further comprises: in response to determining that the target video includes a video frame subsequent to the candidate video frame, using the last acquired candidate video frame as the target video frame and using the last determined result image area as the target image area, continuing to perform the tracking step.

In some embodiments, the method further comprises: acquiring an image to be added; and fusing the image to be added and the image area to be processed of the target video to add the image to be added into the image area to be processed of the target video and obtain the processed video.

In some embodiments, the method further comprises: and sending the obtained processed video to a user terminal in communication connection, and controlling the user terminal to play the processed video.

In a second aspect, an embodiment of the present disclosure provides an apparatus for processing video, the apparatus including: a first acquisition unit configured to acquire an image area selected by a user from a target video frame of a target video as a target image area; a first tracking unit configured to perform the following tracking steps based on the target video frame and the target image area: acquiring a candidate video frame from a target video, wherein the candidate video frame is a video frame which is adjacent to the target video frame and is positioned behind the target video frame; determining at least two image areas with the same size as the target image area from the candidate video frames as candidate image areas; respectively matching at least two candidate image areas and a target image area to obtain matching results; determining a candidate image region matching the target image region from among the at least two candidate image regions as a result image region based on the matching result; determining whether the target video includes a video frame that follows the candidate video frame; in response to determining not to include, determining the target image area and the result image area as image areas to be processed of the target video.

In some embodiments, the first tracking unit is further configured to: determining an enlarged image area for the target image area from the candidate video frame, wherein the enlarged image area comprises an image area corresponding to the target image area; at least two image regions having the same size as the target image region are determined as candidate image regions from the enlarged image region.

In some embodiments, the first tracking unit is further configured to: respectively extracting image characteristics of the target image area and the determined at least two candidate image areas; for the candidate image area in the determined at least two candidate image areas, carrying out correlation operation on the image characteristics of the candidate image area and the image characteristics of the target image area to obtain correlation degree as a matching result, wherein the correlation degree is used for representing the correlation degree of the candidate image area and the target image area; based on the correlation, a candidate image region related to the target image region is determined as a result image region from among the at least two candidate image regions.

In some embodiments, the first tracking unit is further configured to: and respectively extracting directional gradient histogram characteristics of the target image area and the determined at least two candidate image areas.

In some embodiments, the apparatus further comprises: a second tracking unit configured to continue performing the tracking step using the candidate video frame acquired last time as the target video frame and using the result image area determined last time as the target image area in response to determining that the target video includes a video frame located after the candidate video frame.

In some embodiments, the apparatus further comprises: a second acquisition unit configured to acquire an image to be added; and the image fusion unit is configured to fuse the image to be added and the image area to be processed of the target video so as to add the image to be added to the image area to be processed of the target video and obtain a processed video.

In some embodiments, the apparatus further comprises: and the video sending unit is configured to send the obtained processed video to a user terminal of the communication connection and control the user terminal to play the processed video.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method of any of the embodiments of the method for processing video described above.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which when executed by a processor, implements the method of any of the above-described methods for processing video.

According to the method and the device for processing the video, an image area selected by a user from a target video frame of a target video is obtained and used as a target image area; based on the target video frame and the target image area, performing the following tracking steps: acquiring a candidate video frame from a target video, wherein the candidate video frame is a video frame which is adjacent to the target video frame and is positioned behind the target video frame; determining at least two image areas with the same size as the target image area from the candidate video frames as candidate image areas; respectively matching at least two candidate image areas and a target image area to obtain matching results; determining a candidate image region matching the target image region from among the at least two candidate image regions as a result image region based on the matching result; determining whether the target video includes a video frame that follows the candidate video frame; and in response to the determination that the target image area and the result image area are not included, determining the target image area and the result image area as to-be-processed image areas of the target video, so that the object in the target image area selected by the user can be tracked based on an image matching mode, thereby realizing the tracking of various objects in the video, improving the universality of tracking and being beneficial to realizing the uniform processing of any object in the video.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for processing video, according to the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of a method for processing video in accordance with an embodiment of the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of a method for processing video according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for processing video according to the present disclosure;

FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the disclosed method for processing video or apparatus for processing video may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a video processing application, a web browser application, a search application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, and 103 are hardware, they may be various electronic devices with cameras, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a video processing server that processes target video obtained by shooting by the

terminal apparatuses

101, 102, 103. The video processing server may perform processing such as analysis on the received data of the target video and obtain a processing result (e.g., a to-be-processed image area of the target video).

It should be noted that the method for processing video provided by the embodiment of the present disclosure may be executed by the

terminal devices

101, 102, and 103, or may be executed by the server 105, and accordingly, the apparatus for processing video may be disposed in the

terminal devices

101, 102, and 103, or may be disposed in the server 105.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where data used in obtaining the to-be-processed image area of the target video does not need to be acquired from a remote place, the above system architecture may include no network but only a terminal device or a server.

With continued reference to fig. 2, a flow 200 of one embodiment of a method for processing video in accordance with the present disclosure is shown. The method for processing video comprises the following steps:

step 201, an image area selected by a user from a target video frame of a target video is obtained as a target image area.

In this embodiment, an execution subject (for example, a terminal device shown in fig. 1) of the method for processing a video may remotely or locally acquire an image area selected by a user from a target video frame of a target video as a target image area by a wired connection manner or a wireless connection manner. The target video may be a video to be processed. In practice, the target video may be a video obtained by photographing at least one object with a camera.

In this embodiment, the target video frame may be any one of video frames except the video frame ordered at the last bit in the sequence of video frames corresponding to the target video. Specifically, as an example, the target video frame may be a video frame identified from a video frame sequence corresponding to the target video and including a target object (e.g., a human face object) in the first video frame; alternatively, the target video frame may be a video frame arranged at a predetermined position (e.g., the first bit) in the sequence of video frames. It can be understood that the video frame sequences corresponding to the target video are arranged according to the chronological order.

Specifically, the executing body may first obtain the target video, then determine the target video frame from the target video, and obtain an image area selected by the user from the target video frame by using the user terminal as the target image area. Here, if the execution body is a user terminal, the execution body may directly output and display the target video frame so that the user may select an image area from the target video frame; if the execution main body is not a user terminal, the execution main body may output the target video frame to a user terminal of a communication connection, so that a user may select an image area from the target video frame using the user terminal.

In this embodiment, the target image area (the image area selected by the user) may include the target object. The target object is a user-selected object. Specifically, as an example, the user may select an image region including a face object in the target video frame as a target image region, and then the target object is the face object; or, the user may select an image area including a cloud object in the target video frame as the target image area, and the target object is the cloud object. It should be noted that, here, the user may select any image area including the object in the target video frame as the target image area, which is not limited herein.

Step 202, based on the target video frame and the target image area, a tracking step is performed.

In the present embodiment, based on the target video frame and the target image area in step 201, the executing agent may execute the following tracking steps (step 2031 to step 2036):

step 2021, obtain candidate video frames from the target video.

In the present embodiment, the candidate video frame is a video frame adjacent to and subsequent to the target video frame. Here, the video frame positioned after the target video frame is a video frame arranged after the target video frame in the video frame sequence corresponding to the target video.

At step 2022, at least two image regions having the same size as the target image region are determined from the candidate video frames as candidate image regions.

In the present embodiment, the execution subject described above may determine at least two image regions having the same size as the target image region from the candidate video frames as candidate image regions using various methods. As an example, the execution subject may determine at least two image regions having the same size as the target image region from the candidate video frames as candidate image regions in a random manner; alternatively, the execution subject may divide the candidate video frame into a target number of image regions, such that the size of each image region in the target number of image regions is the same as the size of the target image region, and the target number of image regions includes all pixel points on the candidate video frame.

In practice, after a target image area corresponding to a target video frame is obtained, a target object in the target image area can be tracked to determine an image area including the target object in a video frame after the target video frame, so that the target object in the target video can be uniformly processed, and the processing efficiency is improved.

Step 2023, matching the at least two candidate image areas and the target image area, respectively, to obtain a matching result.

In this embodiment, based on the at least two candidate image regions obtained in step 2022, the executing entity may match the at least two candidate image regions and the target image region respectively to obtain at least two matching results. The matching result is in one-to-one correspondence with the candidate image region, and may be used to indicate a matching condition between the corresponding candidate image region and the target image region, and may include, but is not limited to, at least one of the following: characters, numbers, symbols, images. For example, the matching result may include a number "1" and a number "0", where the number "1" may be used to indicate that the candidate image region matches the target image region, and the number "0" may be used to indicate that the candidate image region does not match the target image region; alternatively, the matching result may include integers between 0-10, each of which may characterize a degree of matching, with a larger integer characterizing a higher degree of matching of the candidate image region with the target image region.

Specifically, the executing entity may match the candidate image region and the target image region by using various methods to obtain a matching result. For example, the execution subject described above may calculate the similarity of the candidate image region and the target image region as a matching result. The similarity is a numerical value used for representing the degree of similarity between the candidate image region and the target image region, and may be obtained by calculation using various methods (e.g., cosine similarity algorithm, euclidean distance algorithm, etc.). Specifically, the greater the similarity, the higher the similarity (i.e., the higher the matching degree) between the candidate image region and the target image region can be characterized.

Step 2024, determining a candidate image region matching the target image region from the at least two candidate image regions as a result image region based on the matching result.

In this embodiment, based on the matching result obtained in step 2023, the execution subject may determine, as the result image region, a candidate image region that matches the target image region from among the at least two candidate image regions.

Specifically, the execution body may determine a result image region that matches the target image region from among the at least two candidate image regions using various methods based on the matching result. For example, if the matching result is used to indicate whether the candidate image region matches the target image region, the executing entity may directly select the candidate image region whose corresponding matching result indicates that the candidate image region matches the target image region as the result image region matching the target image region; alternatively, if the matching result is used to indicate the matching degree between the candidate image region and the target image region, the executing entity may select, from the at least two candidate image regions, a candidate image region with the highest matching degree (for example, the highest similarity) with the target image region as the result image region matching with the target image region.

It should be noted that, here, the image area matches with the target image area, and may represent that the image area includes the target object in the target image area.

In some optional implementation manners of this embodiment, the executing body may match the at least two candidate image regions and the target image region respectively by the following steps to obtain matching results:

first, image features of the target image region and the determined at least two candidate image regions may be extracted, respectively. The image features may be various features of the image, for example, the features may be the color, shape, and the like of the image.

Specifically, the executing entity may adopt various methods to respectively extract the image features of the target image region and the determined at least two candidate image regions. For example, a pre-trained feature extraction model (e.g., a convolutional neural network) may be employed to extract image features for the target image region and the determined at least two candidate image regions.

In some optional implementations of the embodiment, the executing body may extract Histogram of Oriented Gradients (HOG) features of the target image region and the determined at least two candidate image regions respectively. The histogram of directional gradients feature is a descriptor for characterizing the local gradient direction and gradient intensity distribution of an image. The main idea is as follows: in the case where the specific position of the edge of the target object is unknown, the distribution of the edge directions may also represent the outline of the target object.

In practice, the step of extracting the histogram of oriented gradients generally comprises: carrying out spatial normalization processing on image colors; calculating image gradient; constructing a gradient direction histogram; normalization processing of the overlapped block histogram; and acquiring directional gradient histogram characteristics.

Optionally, the executing body may further extract Lab features of the target image region and the determined at least two candidate image regions. Here, the Lab feature is a color feature of the image. Specifically, Lab is composed of three elements, namely illumination (L) and a and b related to colors. Wherein, L represents illumination intensity (luminance), which is equivalent to brightness, the value range of L is 0-100, and when L is 50, it is equivalent to 50% black; a represents a range from red to green, and b represents a range from blue to yellow. The value ranges of a and b are both +120 to-120, wherein, the 'a-120' can represent red, and the 'a-120' can represent green; similarly, "b-120" may characterize yellow and "b-120" may characterize blue.

Secondly, for the candidate image area in the at least two determined candidate image areas, carrying out correlation operation on the image characteristics of the candidate image area and the image characteristics of the target image area to obtain the degree of correlation as a matching result.

Wherein, the degree of correlation is used for representing the degree of correlation between the candidate image area and the target image area. Specifically, the degree of correlation may be a numerical value used for characterizing the degree of correlation between the candidate image region and the target image region, and the larger the numerical value is, the more correlated the candidate image region and the target image region is.

Here, the candidate image region and the target image region may be correlated using a correlation function (e.g., a cross-correlation function).

Finally, a candidate image region related to the target image region may be determined from the at least two candidate image regions as a result image region based on the degree of correlation.

Specifically, the executing entity may determine, as the result image region, a candidate image region related to the target image region from the at least two candidate image regions by using various methods based on the correlation degree, for example, the executing entity may select, as the result image region, a candidate image region with the highest correlation degree (with the highest corresponding matching degree) from the at least two candidate image regions; alternatively, the execution subject may select, as the result image region, a candidate image region having a correlation degree greater than or equal to a preset correlation degree threshold (where a corresponding matching degree meets a preset requirement) from the at least two candidate image regions. It is to be understood that the candidate image region related to the target image region is a candidate image region matching the target image region.

Step 2025, determine if the target video includes a video frame that follows the candidate video frame.

Step 2026, in response to determining not to include, determines the target image area and the resulting image area as image areas to be processed of the target video.

In this embodiment, the image area to be processed is an image area to be processed in the target video.

In practice, after the image area to be processed is obtained, the execution main body may perform various processing on the image area to be processed, for example, adjust colors of pixel points in the image area to be processed.

In some optional implementations of this embodiment, the executing body may further continue to execute the tracking step (step 2031-2036) in response to determining that the target video includes a video frame located after the candidate video frame, using the candidate video frame acquired last time as the target video frame, and using the result image area determined last time as the target image area.

In some optional implementation manners of this embodiment, after obtaining the image area to be processed, the executing main body may execute the following steps:

first, the execution body may acquire an image to be added.

Specifically, the execution main body may obtain the image to be added from a remote location or a local location through a wired connection manner or a wireless connection manner. The image to be added is the image to be added to the image area to be processed.

Then, the execution subject may fuse the image to be added and the image area to be processed of the target video, so as to add the image to be added to the image area to be processed of the target video, and obtain a processed video.

Here, the position at which the image to be added is added to the image area to be processed may be a predetermined position (for example, the center position of the image area), or may be a position determined by recognizing the image area to be processed (for example, the position of the target object may be recognized from the image area to be processed, and the position of the target object may be determined as the position for adding the image to be added).

In some optional implementation manners of this embodiment, after the processed video is obtained, the execution main body may further send the obtained processed video to a user terminal in communication connection, and control the user terminal to play the processed video. The user terminal is a terminal used by a user. Specifically, the execution main body may send a control signal to the user terminal, so as to control the user terminal to play the processed video.

Here, since the image area to be processed in the video frame of the processed video is the image area determined based on the target image area selected by the user, processing for different objects in the video can be realized based on different selection modes of the user, and flexibility and universality of video processing are improved.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for processing video according to the present embodiment. In the application scenario of fig. 3, the mobile phone 31 may first acquire the target video 32 obtained by shooting. The handset 31 may then output the target video frames 321 in the target video 32 to the user 33. Next, the mobile phone 31 can acquire the target image area 3211 selected by the user 33 from the target video frame 321. Then, the mobile phone 31 can perform the following tracking steps based on the target video frame 321 and the target image area 3211: acquiring a candidate video frame 322 from the target video 32, wherein the candidate video frame 322 is a video frame adjacent to the target video frame 321 and located after the target video frame 321; three (at least two) image regions having the same size as the target image region 3211 are determined from the candidate video frame 322 as candidate image regions 3221, 3222, and 3223, respectively; matching the candidate image regions 3221, 3222, 3223 and the target image region 3211 to obtain a matching result 341 (e.g., "5"), a matching result 342 (e.g., "7"), and a matching result 343 (e.g., "9"), respectively; determining a candidate image region matching the target image region 3211 (e.g., the candidate image region corresponding to the matching result with the largest value (i.e., the candidate image region 3223 corresponding to the matching result 343)) from the candidate image regions 3221, 3222, and 3223 as the result image region 35 based on the matching results 341, 342, and 343 (e.g., based on the size of the value in the matching result); determining whether the target video 32 includes a video frame that follows the candidate video frame 322; in response to determining not to include, the target image region 3211 and the resultant image region 35 are determined as the to-be-processed image region 36 of the target video 32.

The method provided by the embodiment of the disclosure can track the object in the target image area selected by the user based on the image matching mode, so that tracking of various objects in the video can be realized, the tracking universality is improved, and uniform processing for any object in the video is facilitated.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for processing video is shown. The flow 400 of the method for processing video comprises the steps of:

step 401, acquiring an image area selected by a user from a target video frame of a target video as a target image area.

In this embodiment, an execution subject (for example, a terminal device shown in fig. 1) of the method for processing a video may remotely or locally acquire an image area selected by a user from a target video frame of a target video as a target image area by a wired connection manner or a wireless connection manner. The target video may be a video to be processed. The target video frame may be any one of the video frames except the video frame ordered at the last bit in the video frame sequence corresponding to the target video. The target image area (the user-selected image area) may include a target object. The target object is a user-selected object.

Step 402, based on the target video frame and the target image area, a tracking step is performed.

In this embodiment, based on the target video frame and the target image area in step 401, the executing body may execute the following tracking steps (step 4031-step 4037):

step 4021, acquiring candidate video frames from the target video.

In the present embodiment, the candidate video frame is a video frame adjacent to and subsequent to the target video frame.

Step 4022, determining an enlarged image region for the target image region from the candidate video frames.

Here, the enlarged image area includes an image area corresponding to the target image area, and the size of the enlarged image area is larger than the size of the target image area. The image areas corresponding to the target image areas are the image areas which have the same shape and size as the target image areas respectively, and the positions of the image areas in the candidate video frames are the same as the positions of the target image areas in the target video frames.

Specifically, the executing body may first determine an image region corresponding to the target image region from the candidate video frame, and then enlarge the determined image region to obtain an enlarged image region. The magnification of the magnified image area relative to the target image area may be predetermined or may be randomly generated.

Step 4023, determining at least two image areas with the same size as the target image area from the enlarged image areas as candidate image areas.

In the present embodiment, the execution subject described above may determine at least two image regions having the same size as the target image region as candidate image regions from the enlarged image region using various methods. As an example, the execution subject may determine at least two image regions having the same size as the target image region from the enlarged image region as candidate image regions in a random manner; alternatively, the execution subject may divide the enlarged image area into a target number of image areas, so that the size of each of the target number of image areas is the same as the size of the target image area, and the target number of image areas includes all the pixel points on the enlarged image area.

Step 4024, matching the at least two candidate image areas and the target image area to obtain a matching result.

In this embodiment, based on the at least two candidate image regions obtained in step 4023, the executing entity may match the at least two candidate image regions and the target image region respectively to obtain at least two matching results. The matching result is in one-to-one correspondence with the candidate image region, and may be used to indicate a matching condition between the corresponding candidate image region and the target image region, and may include, but is not limited to, at least one of the following: characters, numbers, symbols, images.

Step 4025, determining a candidate image region matching the target image region from the at least two candidate image regions as a result image region based on the matching result.

In this embodiment, based on the matching result obtained in step 4025, the execution subject may determine a candidate image region matching the target image region from among the at least two candidate image regions as a result image region.

Step 4026, determine whether the target video includes a video frame that follows the candidate video frame.

Step 4027, in response to determining not to include, determining the target image area and the result image area as the to-be-processed image area of the target video.

Step 401, step 4021, step 4024, step 4025, step 4026, and step 4027 may be performed in a manner similar to that of step 201, step 2021, step 2023, step 2024, step 2025, and step 2026 in the foregoing embodiments, respectively, and the above description for step 201, step 2021, step 2023, step 2024, step 2025, and step 2026 also applies to step 401, step 4021, step 4024, step 4025, step 4026, and step 4027, and will not be repeated herein.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for processing video in the present embodiment highlights the steps of determining an enlarged image area for the target image area from the candidate video frames, and then determining at least two image areas having the same size as the target image area from the enlarged image area as candidate image areas. It can be understood that the position of the target object in two adjacent video frames does not change much, so the embodiment may determine the enlarged image region for the target image region from the candidate video frames, and further determine the result image region matching the target image region from the enlarged image region, and thus the solution described in the embodiment may reduce the matching range of the image region, improve the matching efficiency, and may reduce the consumption of processing resources for determining the result image region, thereby helping to reduce the resource consumption in the process of processing the video, and improve the efficiency of video processing.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for processing video, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the apparatus 500 for processing video of the present embodiment includes: a first acquisition unit 501 and a first tracking unit 502. Wherein, the first obtaining unit 501 is configured to obtain an image area selected by a user from a target video frame of a target video as a target image area; the first tracking unit 502 is configured to perform the following tracking steps based on the target video frame and the target image area: acquiring a candidate video frame from a target video, wherein the candidate video frame is a video frame which is adjacent to the target video frame and is positioned behind the target video frame; determining at least two image areas with the same size as the target image area from the candidate video frames as candidate image areas; respectively matching at least two candidate image areas and a target image area to obtain matching results; determining a candidate image region matching the target image region from among the at least two candidate image regions as a result image region based on the matching result; determining whether the target video includes a video frame that follows the candidate video frame; in response to determining not to include, determining the target image area and the result image area as image areas to be processed of the target video.

In this embodiment, the first obtaining unit 501 of the apparatus 500 for processing a video may obtain an image area selected by a user from a target video frame of a target video as a target image area from a remote or local place by a wired connection or a wireless connection. The target video may be a video to be processed. The target video frame may be any one of the video frames except the video frame ordered at the last bit in the video frame sequence corresponding to the target video. The target image area (the user-selected image area) may include a target object. The target object is a user-selected object.

In this embodiment, based on the target video frame and the target image area obtained by the first obtaining unit 501, the first tracking unit 502 may perform the following tracking steps: acquiring candidate video frames from a target video; determining at least two image areas with the same size as the target image area from the candidate video frames as candidate image areas; respectively matching at least two candidate image areas and a target image area to obtain matching results; determining a candidate image region matching the target image region from among the at least two candidate image regions as a result image region based on the matching result; determining whether the target video includes a video frame that follows the candidate video frame; in response to determining not to include, determining the target image area and the result image area as image areas to be processed of the target video.

In the present embodiment, the candidate video frame is a video frame adjacent to and subsequent to the target video frame. The matching result is in one-to-one correspondence with the candidate image region, and may be used to indicate a matching condition of the corresponding candidate image region and the target image region, and may include, but is not limited to, at least one of the following: characters, numbers, symbols, images. The image area to be processed is an image area to be processed in the target video.

In some optional implementations of the present embodiment, the first tracking unit 502 may be further configured to: determining an enlarged image area for the target image area from the candidate video frame, wherein the enlarged image area comprises an image area corresponding to the target image area; at least two image regions having the same size as the target image region are determined as candidate image regions from the enlarged image region.

In some optional implementations of the present embodiment, the first tracking unit 502 may be further configured to: respectively extracting image characteristics of the target image area and the determined at least two candidate image areas; for the candidate image area in the determined at least two candidate image areas, carrying out correlation operation on the image characteristics of the candidate image area and the image characteristics of the target image area to obtain correlation degree as a matching result, wherein the correlation degree is used for representing the correlation degree of the candidate image area and the target image area; based on the correlation, a candidate image region related to the target image region is determined as a result image region from among the at least two candidate image regions.

In some optional implementations of the present embodiment, the first tracking unit 502 may be further configured to: and respectively extracting directional gradient histogram characteristics of the target image area and the determined at least two candidate image areas.

In some optional implementations of this embodiment, the apparatus 500 may further include: a second tracking unit (not shown in the figure) configured to continue performing the tracking step using the candidate video frame acquired last time as the target video frame and using the result image area determined last time as the target image area in response to determining that the target video includes a video frame positioned after the candidate video frame.

In some optional implementations of this embodiment, the apparatus 500 may further include: a second acquisition unit (not shown in the figure) configured to acquire an image to be added; and an image fusion unit (not shown in the figure) configured to fuse the image to be added and the image area to be processed of the target video, so as to add the image to be added to the image area to be processed of the target video, and obtain a processed video.

In some optional implementations of this embodiment, the apparatus 500 may further include: and a video transmitting unit (not shown in the figure) configured to transmit the obtained processed video to a user terminal of the communication connection, and to control the user terminal to play the processed video.

It will be understood that the elements described in the apparatus 500 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.

The apparatus 500 provided by the above embodiment of the present disclosure may track the object in the target image area selected by the user based on the image matching manner, thereby implementing tracking of various objects in the video, improving the versatility of tracking, and facilitating implementation of unified processing for any object in the video.

Referring now to fig. 6, a schematic diagram of an electronic device (e.g., a terminal device or a server in fig. 1) 600 suitable for implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an image area selected by a user from a target video frame of a target video as a target image area; based on the target video frame and the target image area, performing the following tracking steps: acquiring a candidate video frame from a target video, wherein the candidate video frame is a video frame which is adjacent to the target video frame and is positioned behind the target video frame; determining at least two image areas with the same size as the target image area from the candidate video frames as candidate image areas; respectively matching at least two candidate image areas and a target image area to obtain matching results; determining a candidate image region matching the target image region from among the at least two candidate image regions as a result image region based on the matching result; determining whether the target video includes a video frame that follows the candidate video frame; in response to determining not to include, determining the target image area and the result image area as image areas to be processed of the target video.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a cell does not in some cases constitute a limitation on the cell itself, for example, the first acquisition cell may also be described as a "cell acquiring a target image area".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for processing video, comprising:

acquiring an image area selected by a user from a target video frame of a target video as a target image area;

based on the target video frame and the target image area, performing the following tracking steps: acquiring a candidate video frame from the target video, wherein the candidate video frame is a video frame which is adjacent to the target video frame and is positioned behind the target video frame; determining at least two image areas with the same size as the target image area from the candidate video frames as candidate image areas; respectively matching at least two candidate image areas and a target image area to obtain matching results; determining a candidate image region matching the target image region from among the at least two candidate image regions as a result image region based on the matching result; determining whether the target video includes a video frame that follows a candidate video frame; in response to determining that the image area is not included, determining a target image area and a result image area as image areas to be processed of the target video, wherein determining at least two image areas with the same size as the target image area from candidate video frames as candidate image areas specifically comprises:

the candidate video frame is divided into a target number of image areas, so that the size of each image area in the target number of image areas is the same as that of the target image area, and the target number of image areas comprises all pixel points on the candidate video frame.

2. The method of claim 1, wherein the determining at least two image regions of the same size as the target image region from the candidate video frames as candidate image regions further comprises:

determining an enlarged image area for the target image area from the candidate video frame, wherein the enlarged image area comprises an image area corresponding to the target image area;

dividing the amplified image area into a target number of image areas, so that the size of each image area in the target number of image areas is the same as that of the target image area, and the target number of image areas comprises all pixel points on the amplified image area.

3. The method of claim 1, wherein the matching the at least two candidate image regions and the target image region respectively, and obtaining the matching result comprises:

respectively extracting image characteristics of the target image area and the determined at least two candidate image areas;

for the candidate image area in the determined at least two candidate image areas, carrying out correlation operation on the image characteristics of the candidate image area and the image characteristics of the target image area to obtain correlation degree as a matching result, wherein the correlation degree is used for representing the correlation degree of the candidate image area and the target image area; and

the determining, as the result image region, a candidate image region that matches the target image region from among the at least two candidate image regions based on the matching result includes:

based on the correlation, a candidate image region related to the target image region is determined as a result image region from among the at least two candidate image regions.

4. The method of claim 3, wherein the separately extracting image features of the target image region and the determined at least two candidate image regions comprises:

and respectively extracting directional gradient histogram characteristics of the target image area and the determined at least two candidate image areas.

5. The method of claim 1, wherein the method further comprises:

in response to determining that the target video includes a video frame subsequent to the candidate video frame, using the last acquired candidate video frame as the target video frame and using the last determined result image area as the target image area, continuing to perform the tracking step.

6. The method according to one of claims 1-5, wherein the method further comprises:

acquiring an image to be added;

and fusing the image to be added and the image area to be processed of the target video to add the image to be added to the image area to be processed of the target video and obtain a processed video.

7. The method of claim 6, wherein the method further comprises:

and sending the obtained processed video to a user terminal in communication connection, and controlling the user terminal to play the processed video.

8. An apparatus for processing video, comprising:

a first acquisition unit configured to acquire an image area selected by a user from a target video frame of a target video as a target image area;

a first tracking unit configured to perform, based on the target video frame and the target image area, the following tracking steps: acquiring a candidate video frame from the target video, wherein the candidate video frame is a video frame which is adjacent to the target video frame and is positioned behind the target video frame; determining at least two image areas with the same size as the target image area from the candidate video frames as candidate image areas; respectively matching at least two candidate image areas and a target image area to obtain matching results; determining a candidate image region matching the target image region from among the at least two candidate image regions as a result image region based on the matching result; determining whether the target video includes a video frame that follows a candidate video frame; in response to determining that the image area is not included, determining a target image area and a result image area as image areas to be processed of the target video, wherein determining at least two image areas with the same size as the target image area from candidate video frames as candidate image areas specifically comprises:

9. The apparatus of claim 8, wherein the first tracking unit is further configured to:

10. The apparatus of claim 8, wherein the first tracking unit is further configured to:

for the candidate image area in the determined at least two candidate image areas, carrying out correlation operation on the image characteristics of the candidate image area and the image characteristics of the target image area to obtain correlation degree as a matching result, wherein the correlation degree is used for representing the correlation degree of the candidate image area and the target image area;

11. The apparatus of claim 10, wherein the first tracking unit is further configured to:

12. The apparatus of claim 8, wherein the apparatus further comprises:

a second tracking unit configured to continue performing the tracking step using a candidate video frame acquired last time as a target video frame and a result image area determined last time as a target image area in response to determining that the target video includes a video frame located after the candidate video frame.

13. The apparatus according to one of claims 8-12, wherein the apparatus further comprises:

a second acquisition unit configured to acquire an image to be added;

and the image fusion unit is configured to fuse the image to be added and the image area to be processed of the target video, so as to add the image to be added to the image area to be processed of the target video, and obtain a processed video.

14. The apparatus of claim 13, wherein the apparatus further comprises:

a video transmitting unit configured to transmit the obtained processed video to a user terminal of a communication connection, and to control the user terminal to play the processed video.

15. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.