CN114202562A

CN114202562A - Video processing method and device, electronic equipment and storage medium

Info

Publication number: CN114202562A
Application number: CN202111483515.8A
Authority: CN
Inventors: 许通达; 高宸健; 王岩; 袁涛; 秦红伟
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-03-18
Also published as: WO2023103294A1

Abstract

The present disclosure relates to a video processing method and apparatus, an electronic device, and a storage medium, the method including: acquiring a first video frame and a first motion vector between the first video frame and a second video frame; acquiring first position information of a contour key point of a target object in a first video frame and a first mask image of the first video frame; obtaining a second motion vector according to the first motion vector, the first position information and the first mask image; and obtaining second position information of the contour key point of the target object in the second video frame according to the second motion vector and the first position information. According to the video processing method of the embodiment of the disclosure, the accurate contour of the target object can be obtained by performing the identification processing of the target object on the first video frame, the target detection in the subsequent video frame can be performed by using the motion vector, and the time redundancy of the video frame can be used to improve the target detection speed.

Description

Video processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a video processing method and apparatus, an electronic device, and a storage medium.

Background

Fast target detection for video has important application in video processing and transmission. In the video coding, the code rate can be distributed according to the target detection result, and the storage cost is solved. And the video communication can be selectively transmitted according to the target detection result, so that the bandwidth cost is saved, and the delay is reduced.

The current fast video target detection method has two categories, one is a single-frame acceleration category, the method needs to perform feature extraction (feature extraction) frame by frame, does not utilize the temporal redundancy of video frames, and still has a larger acceleration space.

The other is feature domain transformation acceleration, and this method detects an object in feature information of a video frame based on motion vector information in a compressed video stream (compressed bitstream), thereby achieving the purpose of accelerating detection and/or segmentation tasks by utilizing temporal redundancy of the video frame.

Disclosure of Invention

The disclosure provides a video processing method and device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a video processing method including: acquiring a first video frame in a video stream to be processed and a first motion vector between the first video frame and a second video frame, wherein the second video frame is any video frame behind the first video frame; detecting a target object in the first video frame, and acquiring first position information of a contour key point of the target object in the first video frame and a first mask image of the first video frame, wherein the first mask image is an image representing the position and contour of the target object in the first video frame, and the contour key point is located on the contour; obtaining a second motion vector according to the first motion vector, the first position information and the first mask image, wherein the second motion vector is a modified motion vector; and obtaining second position information of the contour key point of the target object in a second video frame according to the second motion vector and the first position information.

According to the video processing method of the embodiment of the disclosure, the accurate contour of the target object can be obtained by performing the identification processing of the target object on the first video frame, the target detection in the subsequent video frame can be performed by using the motion vector, the time redundancy of the video frames can be used, the target detection speed is increased, that is, the target detection is not required to be performed frame by frame, but the detection result of the target object in other video frames can be obtained by using the sparse motion vector information between the video frames, and the detection efficiency is increased. In addition, the target detection is carried out through the corrected motion vector, so that the accumulated error of the motion vector can be reduced, and the accuracy and the robustness of the target detection are improved.

In a possible implementation manner, obtaining a second motion vector according to the first motion vector, the first position information, and the first mask image includes: obtaining a component feature map according to the first motion vector, wherein the component feature map is determined by components of the first motion vector; inputting the component characteristic diagram, the first position information and the first mask image into a correction neural network to obtain a motion vector correction quantity; and obtaining the second motion vector according to the motion vector correction quantity and the first motion vector.

In one possible implementation, obtaining a component feature map according to the first motion vector includes: decomposing the first motion vector to obtain a first dimension component and a second dimension component; and respectively obtaining component feature maps according to the first dimension component and the second dimension component.

In this way, the corrected second motion vector can be obtained, and the correction process can reduce the accumulated error, correct the position of the key point of the contour and keep the shape of the contour. The position of the contour key points in the first video frame is transformed through the second motion vector, so that the accuracy of position information can be improved.

In one possible implementation, the method further includes: detecting a first sample video frame of a sample video stream to obtain first sample position information of a contour key point of a target object; acquiring a first sample mask image of the first sample video frame and a sample motion vector between the first sample video frame and a second sample video frame, wherein the first sample mask image is an image representing the position and contour of a target object in the first sample video frame, the contour key point is located on the contour, and the second sample video frame is any video frame behind the first sample video frame; obtaining a corrected motion vector according to the sample motion vector, the first sample mask image, the first sample position information and the corrected neural network; obtaining a reference motion vector according to the first sample video frame and the second sample video frame; obtaining the network loss of the modified neural network according to the modified motion vector and the reference motion vector; and training the modified neural network according to the network loss.

In one possible implementation, obtaining a modified motion vector according to the sample motion vector, the first sample mask image, the first sample position information, and the modified neural network includes: obtaining a sample component characteristic diagram according to the sample motion vector and a preset noise signal; inputting the sample component feature map, the first sample mask image and the first sample position information into the correction neural network to obtain a sample correction amount; and obtaining a corrected motion vector according to the sample correction quantity and the sample motion vector.

In this way, the capability of correcting errors of the modified neural network can be improved by adding random noise in the training process, and the accuracy and the robustness of the modified neural network are improved.

In one possible implementation, the method further includes: and obtaining a second mask image of the second video frame according to second position information of the contour key point of the target object in the second video frame, wherein the second mask image is an image representing the position and contour of the target object in the second video frame.

In a possible implementation manner, obtaining a second mask image of a second video frame according to second position information of a contour key point of the target object in the second video frame includes: connecting the contour key points in the second video frame according to the relative relation between the contour key points in the first video frame to obtain the contour of the target object in the second video frame; and obtaining the second mask image according to the outline of the target object in the second video frame.

According to an aspect of the present disclosure, there is provided a video processing apparatus including: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first video frame in a video stream to be processed and a first motion vector between the first video frame and a second video frame, and the second video frame is any video frame behind the first video frame; the detection module is used for detecting a target object in the first video frame, and acquiring first position information of a contour key point of the target object in the first video frame and a first mask image of the first video frame, wherein the first mask image is an image representing the position and contour of the target object in the first video frame, and the contour key point is located on the contour; a correction module, configured to obtain a second motion vector according to the first motion vector, the first position information, and the first mask image, where the second motion vector is a corrected motion vector; and the position obtaining module is used for obtaining second position information of the contour key point of the target object in a second video frame according to the second motion vector and the first position information.

In one possible implementation, the modification module is further configured to: obtaining a component feature map according to the first motion vector, wherein the component feature map is determined by components of the first motion vector; inputting the component characteristic diagram, the first position information and the first mask image into a correction neural network to obtain a motion vector correction quantity; and obtaining the second motion vector according to the motion vector correction quantity and the first motion vector.

In one possible implementation, the modification module is further configured to: decomposing the first motion vector to obtain a first dimension component and a second dimension component; and respectively obtaining component feature maps according to the first dimension component and the second dimension component.

In one possible implementation, the apparatus further includes: the training module is used for detecting and processing a first sample video frame of a sample video stream and acquiring first sample position information of contour key points of a target object; acquiring a first sample mask image of the first sample video frame and a sample motion vector between the first sample video frame and a second sample video frame, wherein the first sample mask image is an image representing the position and contour of a target object in the first sample video frame, the contour key point is located on the contour, and the second sample video frame is any video frame behind the first sample video frame; obtaining a corrected motion vector according to the sample motion vector, the first sample mask image, the first sample position information and the corrected neural network; obtaining a reference motion vector according to the first sample video frame and the second sample video frame; obtaining the network loss of the modified neural network according to the modified motion vector and the reference motion vector; and training the modified neural network according to the network loss.

In one possible implementation, the training module is further configured to: obtaining a sample component characteristic diagram according to the sample motion vector and a preset noise signal; inputting the sample component feature map, the first sample mask image and the first sample position information into the correction neural network to obtain a sample correction amount; and obtaining a corrected motion vector according to the sample correction quantity and the sample motion vector.

In one possible implementation, the apparatus further includes: and the mask obtaining module is used for obtaining a second mask image of the second video frame according to second position information of the contour key point of the target object in the second video frame, wherein the second mask image is an image representing the position and the contour of the target object in the second video frame.

In one possible implementation, the mask obtaining module is further configured to: connecting the contour key points in the second video frame according to the relative relation between the contour key points in the first video frame to obtain the contour of the target object in the second video frame; and obtaining the second mask image according to the outline of the target object in the second video frame.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of a video processing method according to an embodiment of the present disclosure;

fig. 2 shows an application schematic diagram of a video processing method according to an embodiment of the present disclosure;

fig. 3 shows a block diagram of a video processing apparatus according to an embodiment of the present disclosure;

FIG. 4 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure;

fig. 5 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flow diagram of a video processing method according to an embodiment of the present disclosure, as shown in fig. 1, the method comprising:

in step S11, a first video frame in the video stream to be processed and a first motion vector between the first video frame and a second video frame are obtained, where the second video frame is any video frame after the first video frame;

in step S12, performing recognition processing on a target object in the first video frame, and acquiring first position information of a contour key point of the target object in the first video frame and a first mask image of the first video frame, where the first mask image is an image representing a position and a contour of the target object in the first video frame, and the contour key point is located on the contour;

in step S13, obtaining a second motion vector according to the first motion vector, the first position information, and the first mask image, where the second motion vector is a modified motion vector;

in step S14, second position information of the contour keypoint of the target object in the second video frame is obtained according to the second motion vector and the first position information.

In one possible implementation, object detection is widely applied in video processing, for example, the position and/or feature of a detected object may be stored, it may not be necessary to store each frame of video frames, and the storage pressure is reduced, and the position and/or feature of the detected object may also be transmitted, and it is not necessary to transmit each frame of video frames, and the transmission pressure is reduced. Generally, target detection can be performed on each frame of video frame, or target detection can be performed by sampling a part of video frames, but the detection usually needs to adopt methods such as deep learning, and the efficiency of detecting a large number of video frames is not high. For example, in a video code stream such as H263, H264, H265, H266, VP8, VP9, AV1, AVs, etc., there are motion vectors between video frames, and the position of a target object in each frame can be determined using the motion vectors, but the motion vectors have sparseness mainly representing motion characteristics of pixel points, and contain less image content, so that it is difficult to determine the contour of the target object and obtain accurate position information by performing feature domain conversion only using the motion characteristics.

In one possible implementation, to address the above issues, the present disclosure performs target detection on a key frame (e.g., a designated video frame, or a video frame containing a complete target object) itself, obtains an accurate contour of the target object, and location information of key points on the contour. And the accurate contour and the key points are transformed through the motion vectors, and the accurate contour of the target object can be obtained under the condition of utilizing the time redundancy of the video frame and improving the target detection efficiency. In addition, the motion vector can be corrected, the error of the motion vector is reduced, and the accuracy and the robustness of target detection are improved.

In one possible implementation manner, in step S11, video frames in the video stream to be processed and motion vectors between the video frames can be obtained. In the example, in the video code streams of H263, H264, H265, H266, VP8, VP9, AV1, AVs, etc., there are motion vectors between video frames. The video stream to be processed may belong to any one of the above multiple code streams. The video stream to be processed may be decoded (e.g., using a ffmpeg decoder), resulting in motion vectors between video frames, as well as video frames in the video stream to be processed.

In an example, a first motion vector between a first video frame and a second video frame may be obtained. The second video frame is an arbitrary video frame different from the first video frame. For example, if the second video frame is the next adjacent video frame of the first video frame, the first motion vector can be obtained directly by decoding. For another example, in a video in which the second video frame is not adjacent to the first video frame, there is an interval of n (n is a positive integer) video frames therebetween (the second video frame follows the first video frame). By the decoding processing, the motion vector between any two adjacent videos can be obtained, and then the motion vectors between all the video frames at intervals between the first video frame and the second video frame can be subjected to vector addition, so that the motion vector between the first video frame and the second video frame can be obtained. For example, the first video frame is T0, the second video frame is Tn +1, a T1 and a T2 … Tn video frame are separated between the two, the motion vector between the first video frame T0 and the video frame T1 is Mv0, the motion vector between the video frame T1 and the video frame T2 is Mv1 …, the motion vector between the video frame Tn and the second video frame Tn +1 is Mvn, and then the first motion vector is Mv0+ Mv1+ … + Mvn. The present disclosure does not limit the number of video frames that are spaced between the first video frame and the second video frame.

In one possible implementation, in step S12, target detection may be performed on the keyframes to determine an accurate contour of the target object. The method comprises the steps of detecting a target object in a first video frame of a video stream to be processed, and acquiring first position information of a contour key point of the target object in the first video frame. In an example, the first video frame may be a designated key frame, or any video frame containing a complete target object. In an example, a first video frame can be detected through a deep learning method to obtain a detection result of a target object, and then a contour line of the target object and contour key points on the contour line are obtained, wherein the contour key points can include points capable of representing main features of the contour line, such as a point capable of representing the widest position of the contour line, a point at the highest position of the contour line, and the like.

Further, after the contour line of the target object is obtained, a first mask image of the first video frame may also be obtained, for example, the first mask image may be obtained by setting the pixel values of all the pixel points inside and on the contour line to 1 and setting the pixel values outside the contour line to 0. The first mask image may represent a position and a contour of the target object, and may also be used for constraining and correcting a shape of the contour of the target object in a subsequent process (for example, in a position transformation by a motion vector), and the like.

In a possible implementation manner, after the first motion vector and the first position information are obtained, the positions of the outline key points in the first video frame may be transformed through the first motion vector, and the positions of the outline key points in the second video frame are obtained. In addition, the first motion vector may be obtained by adding motion vectors between a plurality of adjacent video frames, and if there is an error in the motion vector, the accumulated error in the first motion vector obtained by accumulation may be large, which affects the position accuracy of the contour keypoint in the second video frame. Therefore, in order to improve the position accuracy and the robustness of the position transformation, the first motion vector may be modified.

In one possible implementation, as described above, the first mask image may be used to constrain and correct the shape of the contour of the target object, for example, if a certain contour key point is off the contour line during the position transformation, the first mask image may be used to correct the contour key point, so that the contour shape is maintained, and at the same time, the contour key point off the contour line during the position transformation may be corrected to maintain the position on the contour line.

In one possible implementation, in step S13, the first motion vector may be modified by the first position information (i.e., the position information of the contour keypoint in the first video frame) and the first mask image, and the second motion vector is obtained. As described above, the first mask image may constrain and modify the shape of the contour. The first position information may also serve to maintain the shape of the contour line and modify the position of the contour key points, for example, the relative positional relationship between the contour key points may be determined by the first position information, and the relative positional relationship may be maintained during the position conversion process to maintain the shape of the contour line. If the motion vector generates errors, the positions of one or some key points generate errors, and further the relative position relationship is changed, the relative position relationship of the contour key points in the second video frame can be corrected based on the relative position relationship determined by the first position information, so that the relative position relationship between the contour key points is kept stable, further, the first motion vector can be corrected based on the relative position relationship, the error of the first motion vector is reduced, and further the error of the relative position relationship between the contour key points and the contour key points is reduced. The effect of the first location information is not limited by this disclosure.

In one possible implementation, the first motion vector may be modified by a modifying neural network to obtain the second motion vector. Step S13 may include: obtaining a component feature map according to the first motion vector, wherein the component feature map is determined by components of the first motion vector; inputting the component characteristic diagram, the first position information and the first mask image into a correction neural network to obtain a motion vector correction quantity; and obtaining the second motion vector according to the motion vector correction quantity and the first motion vector.

In one possible implementation, the modified neural network may be a deep learning neural network, such as a convolutional neural network, and the present disclosure does not limit the type of modified neural network. The input quantity can be in the form of an image or a feature map, and a component feature map can be obtained based on the first motion vector, namely, the components of the motion vector are represented in the form of the feature map and used as the input quantity of the modified neural network.

In a possible implementation manner, the first motion vector may represent a motion vector of each pixel in the first video frame, for example, coordinates of a certain pixel (which may be a contour key point, or any other pixel) in the first video frame are (x, y), and if the vector corresponding to the pixel is (Δ x, Δ y) as known from the first motion vector, the position of the pixel in the second video frame is (x + Δx, y + Δy). That is, the vector of each pixel point in the first motion vector also includes two dimensions, and thus, the first motion vector can be decomposed into two dimensions, i.e., a first dimension component and a second dimension component. For example, the first dimension component may represent a component of each pixel point in a first dimension direction, e.g., a component in an x direction, i.e., a value of Δ x in a vector corresponding to each pixel point, and similarly, the second dimension component may represent a component of each pixel point in a second dimension direction, e.g., a component in a y direction, i.e., a value of Δ y in a vector corresponding to each pixel point.

In one possible implementation, the component feature map of the first dimension may be obtained according to the first dimension component of the vector corresponding to each pixel point. For example, a value of a first-dimension component of a vector corresponding to each pixel point in the first video frame is a pixel value of a corresponding pixel point of the first-dimension component feature map, for example, a value Δ x in the vector corresponding to each pixel point is a pixel value of a corresponding pixel point of the first-dimension component feature map. Similarly, the value of the second-dimensional component of the vector corresponding to each pixel point in the first video frame is the pixel value of the corresponding pixel point of the second-dimensional component feature map, for example, the value of Δ y in the vector corresponding to each pixel point is the pixel value of the corresponding pixel point of the second-dimensional component feature map. The component feature maps of the first dimension and the component feature maps of the second dimension are component feature maps of two channels, for example, feature maps of two input channels of an input modified neural network.

In a possible implementation manner, the component feature maps of the two channels, the first position information and the first mask image are input into a modified neural network, and the processing of maintaining the contour shape, maintaining the relative position relationship of the contour key points, correcting the positions of the contour key points, and modifying the first motion vector can be performed through the modified neural network, so as to obtain a motion vector modification amount, that is, a modification parameter for modifying the first motion vector.

In a possible implementation manner, the first motion vector may be modified by a modification parameter to obtain a second motion vector. In an example, the correction parameter may be in the form of a vector, for example, when the correction amount of a certain pixel is (xt, yt), and the vector (Δ x, Δ y) of the corresponding pixel in the first motion vector is corrected, vector addition may be performed to obtain a corrected vector (Δ x + xt, Δ y + yt) of the pixel. After the vector of each pixel point is corrected, a second motion vector can be obtained. The motion vector correction amount can also be in a matrix form, and when the correction is carried out, the motion vector correction amount can be multiplied by the vector of the corresponding pixel point of the first motion vector to obtain a second motion vector. The present disclosure does not limit the specific form and modification manner of the motion vector correction amount.

In a possible implementation manner, before the modified neural network is used for modification, the modified neural network may be trained, for example, multiple segments of videos may be selected from video samples as a sample video stream, and information such as key frames and motion vectors may be obtained from the sample video stream, and the modified neural network may be trained through the information. The method further comprises the following steps: detecting a first sample video frame of a sample video stream to obtain first sample position information of a contour key point of a target object, wherein the first sample video frame is any video frame of the sample video stream; acquiring a first sample mask image of the first sample video frame and a sample motion vector between the first sample video frame and a second sample video frame, wherein the first sample mask image is an image representing the position and contour of a target object in the first sample video frame, the contour key point is located on the contour, and the second sample video frame is any video frame behind the first sample video frame; obtaining a corrected motion vector according to the sample motion vector, the first sample mask image, the first sample position information and the corrected neural network; obtaining a reference motion vector according to the first sample video frame and the second sample video frame; obtaining the network loss of the modified neural network according to the modified motion vector and the reference motion vector; and training the modified neural network according to the network loss.

In a possible implementation manner, the sample video stream may be decoded to obtain video frames in the sample video stream, where the first sample video frame is any video frame in the sample video stream that includes the complete target object, for example, the first sample video frame is a key frame in the sample video stream. The first sample video frame can be detected through a deep learning method, the contour of the target object is obtained, and the first sample position information of the key point of the contour is determined. The pixel values of all the pixel points in and on the contour line can also be set to 1, and the pixel values outside the contour line can be set to 0, so that the first sample mask image is obtained. Further, in the above-described decoding process, a sample motion vector between the first sample video frame and the second sample video frame (any video frame different from the first sample video frame in the sample video stream) may also be obtained.

In one possible implementation, in order to improve the ability of the modified neural network to correct the error of the motion vector, a random error may be artificially added to the sample motion vector, and the modified neural network may be used to perform the modification, so that the modified neural network improves the correction ability. When random noise is added, uniformly distributed noise may be added, for example, uniformly distributed noise within [ -16, 16], and the present disclosure does not limit the type and scope of random noise. Further, noise may be added directly to the sample motion vector, or may be added to the sample component feature map of the sample motion vector.

In an example, a noise signal may be added to the sample motion vector, for example, the sample motion vector includes two-dimensional components of a plurality of pixel points, noise may be randomly added to the vectors of some or all of the pixel points, and when adding, a random value may be added to the two-dimensional components. And obtaining a sample component feature map with noise by using the sample motion vector added with noise, wherein the obtaining mode is consistent with the component feature map and is not repeated herein.

In an example, the sample component feature map of the sample motion vector may also be obtained first, and then the noise signal may be added to the sample component feature map, for example, random noise may be added to the vector components of some or all of the pixel points of the sample component feature map of the first dimension, and random noise may be added to the vector components of some or all of the pixel points of the sample component feature map of the second dimension, so as to obtain the sample component feature map with noise. The present disclosure does not limit the order of addition of the noise.

In one possible implementation manner, the sample component feature map with noise, the first sample mask image and the first sample position information may be input into a modified neural network, the modified neural network performs processing of maintaining the shape of the contour, maintaining the relative position relationship of the contour key points, correcting the positions of the contour key points, and modifying the sample motion vector, so as to obtain a sample modification amount, and the sample motion vector is modified based on the sample modification amount, so as to obtain a modified motion vector. The correction method is the same as the method for correcting the first motion vector by the motion vector correction amount, and is not described herein again.

In one possible implementation, when correcting, because the correction neural network has network loss (e.g., error), the sample correction amount has error, so that the correction motion vector also has error. The error may be determined by modifying the motion vector and the true motion vector between the first sample video frame and the second sample video frame. For example, a reference motion vector, i.e., a motion vector in which a pixel point moves from a position in the first sample video frame to a position in the second sample video frame, which is an error-free motion vector, may be determined based on the first sample video frame and the second sample video frame, and may be used as a reference for determining an error for correcting the motion vector.

In one possible implementation, the network loss of the modified neural network may be determined based on an error between the modified motion vector and the reference motion vector, for example, an output of the modified neural network, i.e., an error of the sample modifier, may be determined based on an error between the vectors, and the network loss of the modified neural network may be determined based on the error of the sample modifier.

In one possible implementation, the modified neural network may be trained based on the network loss, i.e., the network parameters of the modified neural network are adjusted in a direction that reduces the network loss. The training process can be iteratively executed for multiple times until the training reaches a preset number, or the network loss converges in a preset interval or is smaller than a preset threshold value, and the training can be completed to obtain the trained modified neural network.

In a possible implementation manner, the trained modified neural network can be used for modifying the first motion vector, so as to obtain a second motion vector with higher precision. And in step S14, the first location information of the contour keypoint in the first video frame may be transformed based on the second motion vector, and the second location information of the contour keypoint in the second video frame is obtained. The transformation in step S14 may be implemented based on a correlation method, for example, by a vector operation method, to obtain second position information of the contour keypoint in the second video frame.

In a possible implementation manner, further, a second mask image of the second video frame may be obtained based on the second position information. The method further comprises the following steps: and obtaining a second mask image of the second video frame according to second position information of the contour key point of the target object in the second video frame, wherein the second mask image is an image representing the position and contour of the target object in the second video frame.

In one possible implementation, as described above, the shape of the contour of the target object may be maintained during the position transformation by the second motion vector, and thus, the shape and position of the contour of the target object may still be represented by the contour in the second video frame. In an example, the second mask image may be obtained by setting the pixel values of the pixels inside and on the contour line to 1 and the pixel values of the pixels outside the contour line to 0.

In one possible implementation, to maintain the shape of the contour, the keypoints of the contour in the first video frame may have a certain relative relationship, e.g., a sequential relationship. The contour in the first video frame may be obtained by connecting the keypoints in the order of the keypoints. Similarly, after the second position information of the outline key point in the second video frame is obtained, the outline in the second video frame can still be obtained based on the relative relationship, so as to obtain the second mask image. Obtaining a second mask image of a second video frame according to second position information of the contour key point of the target object in the second video frame, wherein the second mask image comprises: connecting the contour key points in the second video frame according to the relative relation between the contour key points in the first video frame to obtain the contour of the target object in the second video frame; and obtaining the second mask image according to the outline of the target object in the second video frame.

In one possible implementation, as described above, the contour key points in the first video frame have a certain relative relationship, for example, a sequential relationship, and in the second video frame, the relative relationship, for example, the sequential relationship between the key points, may be maintained, and the connection may be performed according to the sequential relationship, so that the contour of the target object in the second video frame may be obtained while maintaining the shape of the target object. The relative relationship may include not only a sequential relationship but also a relative positional relationship, a connection relationship, etc., and taking the connection relationship as an example, in the first video frame, the contour key point a is connected to the contour key point B, and is not connected to the contour key point C, the connection relationship may be maintained in the second video frame, and the connection may be performed according to the connection relationship, and the contour of the target object in the second video frame may be obtained while maintaining the shape of the target object.

In a possible implementation manner, after obtaining the contour of the target object in the second video frame, the pixels inside and outside the contour may be processed differently, for example, the second mask image may be obtained by setting the pixel value of the pixel inside the contour to 1 and the pixel value of the pixel outside the contour to 0.

According to the video processing method of the embodiment of the disclosure, the accurate contour of the target object can be obtained by performing the identification processing of the target object on the first video frame, the target detection in the subsequent video frame can be performed by using the motion vector, the time redundancy of the video frames can be used, the target detection speed is increased, that is, the target detection is not required to be performed frame by frame, but the detection result of the target object in other video frames can be obtained by using the sparse motion vector information between the video frames, and the detection efficiency is increased. In addition, the motion vector can be corrected by correcting the neural network, the accumulated error is reduced, the position of the key point of the contour is corrected, and the shape of the contour is kept. In the process of training the modified neural network, random noise can be added to improve the error correcting capability of the modified neural network and improve the accuracy and robustness of the modified neural network.

Fig. 2 is a schematic diagram illustrating an application of a video processing method according to an embodiment of the present disclosure, and as shown in fig. 2, a sample video stream may be decoded to obtain a sample motion vector between a key frame and a T-th non-key frame, and a key frame in the sample video stream may be subjected to target detection to obtain first sample position information of a contour key point. The first sample mask image may also be obtained based on the contour of the target object in the keyframe.

In one possible implementation manner, the first sample position information may be decomposed to obtain a sample component feature map in the x direction and a sample component feature map in the y direction. And uniformly distributed noise in the range of [ -16, 16] can be added to the two sample component feature maps to obtain a sample component feature map with noise.

In one possible implementation, a sample component feature map with noise, first sample position information, and a first sample mask image may be input to a modified neural network for training, a sample correction amount may be obtained, a sample motion vector may be modified, a modified motion vector may be obtained, a network loss of the modified neural network may be determined based on the modified motion vector and an error between true motion vectors between a key frame and a T-th non-key frame, and the modified neural network may be trained in a direction that reduces the network loss.

In one possible implementation, the trained modified neural network may be used to determine the contour of the target object in any video frame in the video stream. The method comprises the steps of firstly decoding a video stream to obtain a motion vector between a key frame and any video frame, obtaining position information of contour key points of a target object in the key frame in the video stream, and obtaining a mask image of the key frame. Further, the motion vector can be decomposed into component feature maps of x and y channels, the component feature maps, position information of the contour key points and the mask map are input into a correction neural network to obtain a motion vector correction amount, the motion vector is corrected to obtain a corrected motion vector, and position information of the contour key points in the key frame is subjected to position transformation based on the corrected motion vector to obtain the position of the contour key points of the target object in any video frame.

In a possible implementation manner, the video processing method can be used for rapidly detecting the target in the video, and only the contour of the target object in the key frame needs to be detected, so that the position of the target object in any video frame can be rapidly obtained through the modified motion vector, and the accuracy and efficiency of target detection are improved. The method can be used for detecting the target in the fields of monitoring, live broadcasting and the like, and can also be used for detecting and tracking the target in any video. The present disclosure does not limit the application field of the video processing method.

Fig. 3 shows a block diagram of a video processing apparatus according to an embodiment of the present disclosure, which, as shown in fig. 3, includes: an obtaining module 11, configured to obtain a first video frame in a video stream to be processed and a first motion vector between the first video frame and a second video frame, where the second video frame is any video frame after the first video frame; a detection module 12, configured to perform detection processing on a target object in the first video frame, and acquire first position information of a contour key point of the target object in the first video frame and a first mask image of the first video frame, where the first mask image is an image representing a position and a contour of the target object in the first video frame, and the contour key point is located on the contour; a correction module 13, configured to obtain a second motion vector according to the first motion vector, the first position information, and the first mask image, where the second motion vector is a corrected motion vector; and a position obtaining module 14, configured to obtain second position information of the contour key point of the target object in a second video frame according to the second motion vector and the first position information.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a video processing apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any video processing method provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are not repeated.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code, which when run on a device, a processor in the device executes instructions for implementing the video processing method provided in any of the above embodiments.

The embodiments of the present disclosure also provide another computer program product for storing computer readable instructions, which when executed cause a computer to perform the operations of the video processing method provided in any of the above embodiments.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 4 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 4, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense an edge of a touch or slide action, but also detect a duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 5 illustrates a block diagram of an electronic device 1900 in accordance with an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 5, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as Windows Server, stored in memory 1932^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TMOr the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A video processing method, comprising:

acquiring a first video frame in a video stream to be processed and a first motion vector between the first video frame and a second video frame, wherein the second video frame is any video frame behind the first video frame;

detecting a target object in the first video frame, and acquiring first position information of a contour key point of the target object in the first video frame and a first mask image of the first video frame, wherein the first mask image is an image representing the position and contour of the target object in the first video frame, and the contour key point is located on the contour;

obtaining a second motion vector according to the first motion vector, the first position information and the first mask image, wherein the second motion vector is a modified motion vector;

and obtaining second position information of the contour key point of the target object in a second video frame according to the second motion vector and the first position information.

2. The method of claim 1, wherein obtaining a second motion vector based on the first motion vector, the first position information, and the first mask image comprises:

obtaining a component feature map according to the first motion vector, wherein the component feature map is determined by components of the first motion vector;

inputting the component characteristic diagram, the first position information and the first mask image into a correction neural network to obtain a motion vector correction quantity;

and obtaining the second motion vector according to the motion vector correction quantity and the first motion vector.

3. The method of claim 2, wherein obtaining a component feature map from the first motion vector comprises:

decomposing the first motion vector to obtain a first dimension component and a second dimension component;

and respectively obtaining component feature maps according to the first dimension component and the second dimension component.

4. The method of claim 2, further comprising:

detecting a first sample video frame of a sample video stream to obtain first sample position information of a contour key point of a target object;

acquiring a first sample mask image of the first sample video frame and a sample motion vector between the first sample video frame and a second sample video frame, wherein the first sample mask image is an image representing the position and contour of a target object in the first sample video frame, the contour key point is located on the contour, and the second sample video frame is any video frame behind the first sample video frame;

obtaining a corrected motion vector according to the sample motion vector, the first sample mask image, the first sample position information and the corrected neural network;

obtaining a reference motion vector according to the first sample video frame and the second sample video frame;

obtaining the network loss of the modified neural network according to the modified motion vector and the reference motion vector;

and training the modified neural network according to the network loss.

5. The method of claim 4, wherein obtaining a modified motion vector based on the sample motion vector, the first sample mask image, the first sample position information, and the modified neural network comprises:

obtaining a sample component characteristic diagram according to the sample motion vector and a preset noise signal;

inputting the sample component feature map, the first sample mask image and the first sample position information into the correction neural network to obtain a sample correction amount;

and obtaining a corrected motion vector according to the sample correction quantity and the sample motion vector.

6. The method according to any one of claims 1 to 5, further comprising:

and obtaining a second mask image of the second video frame according to second position information of the contour key point of the target object in the second video frame, wherein the second mask image is an image representing the position and contour of the target object in the second video frame.

7. The method according to claim 6, wherein obtaining a second mask image of a second video frame according to second position information of contour key points of the target object in the second video frame comprises:

connecting the contour key points in the second video frame according to the relative relation between the contour key points in the first video frame to obtain the contour of the target object in the second video frame;

and obtaining the second mask image according to the outline of the target object in the second video frame.

8. A video processing apparatus, comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first video frame in a video stream to be processed and a first motion vector between the first video frame and a second video frame, and the second video frame is any video frame behind the first video frame;

the detection module is used for detecting a target object in the first video frame, and acquiring first position information of a contour key point of the target object in the first video frame and a first mask image of the first video frame, wherein the first mask image is an image representing the position and contour of the target object in the first video frame, and the contour key point is located on the contour;

a correction module, configured to obtain a second motion vector according to the first motion vector, the first position information, and the first mask image, where the second motion vector is a corrected motion vector;

and the position obtaining module is used for obtaining second position information of the contour key point of the target object in a second video frame according to the second motion vector and the first position information.

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any of claims 1 to 7.

10. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 7.