CN114565879A

CN114565879A - Feature fusion method and device and video jitter elimination method and device

Info

Publication number: CN114565879A
Application number: CN202210199531.2A
Authority: CN
Inventors: 黄钊金; 戴宇荣
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2022-05-31

Abstract

The disclosure relates to a feature fusion method and device and a video jitter elimination method and device. The method comprises the following steps: acquiring at least one piece of first similarity information based on the characteristics of a current image frame and the characteristics of at least one reference image frame in a video to be processed, wherein the at least one reference image frame is a preset number of image frames in the video to be processed before the current image frame, and the at least one piece of first similarity information is the similarity information between the current image frame and the at least one reference image frame respectively; obtaining the characteristics of at least one reference image frame aligned with the current image frame respectively based on the characteristics of at least one reference image frame and at least one piece of first similarity information; and obtaining the fused feature of the current image frame based on the aligned feature and the feature of the current image frame.

Description

Feature fusion method and device and video jitter elimination method and device

Technical Field

The present disclosure relates to the field of video processing, and in particular, to a feature fusion method and apparatus, and a video jitter elimination method and apparatus.

Background

The video is information which is exposed to a large amount of daily life, and with the development of deep learning, a user hopes to understand contents in the video through the deep learning, for example, information such as a number of people and a moving track of people exists in the video. However, in some complex scenes, two similar image frames in the video may obtain very different results, for example, two image frames may look like a similar picture with naked eyes, the former image frame may detect the limbs of a person, and the latter image frame may not detect the limbs suddenly, which is called "shake" phenomenon, and the shake phenomenon may seriously affect the user's look and feel and other quantitative indicators.

In order to solve the above problems, the academic world proposes "time sequence feature fusion methods" based on image algorithms for processing, for example, methods such as rnn/non local are used, and these time sequence feature fusion methods can well alleviate the jitter problem, but these methods are only useful in academic datasets, and have many problems in real scenes, mainly because many data in real scenes are not present in the academic datasets, resulting in poor algorithm effect, and these models need to be retrained using the data in real scenes. However, this training process requires video data of many real scenes to be labeled, and especially for a temporal feature fusion module that requires a large amount of video data for training, it is extremely difficult.

Disclosure of Invention

The present disclosure provides a feature fusion method and apparatus, and a video jitter elimination method and apparatus, so as to at least solve the problem of relatively high cost, such as manpower, caused by the fact that a large amount of video data is required for training in the feature fusion method in the related art.

According to a first aspect of the embodiments of the present disclosure, there is provided a feature fusion method, including: acquiring at least one piece of first similarity information based on the characteristics of a current image frame and the characteristics of at least one reference image frame in a video to be processed, wherein the at least one reference image frame is a preset number of image frames in the video to be processed before the current image frame, and the at least one piece of first similarity information is the similarity information between the current image frame and the at least one reference image frame respectively; obtaining the characteristics of at least one reference image frame aligned with the current image frame respectively based on the characteristics of at least one reference image frame and at least one piece of first similarity information; and obtaining the fused feature of the current image frame based on the aligned feature and the feature of the current image frame.

Optionally, obtaining a fused feature of the current image frame based on the aligned feature and the feature of the current image frame includes: respectively performing point multiplication on the aligned features and the features of the current image frame to obtain at least two pieces of second similarity information; multiplying the aligned features and the features of the current image frame with corresponding second similarity information in at least two pieces of second similarity information respectively to obtain at least two products; and summing at least two products to obtain the fused characteristics of the current image frame.

Optionally, the multiplying the aligned features and the features of the current image frame with corresponding second similarity information in at least two pieces of second similarity information respectively to obtain at least two products, including: normalizing the at least two pieces of second similarity information to obtain at least two pieces of normalized second similarity information; and multiplying the aligned features and the current image frame by corresponding second similarity information in the at least two pieces of normalized second similarity information to obtain at least two products.

Optionally, obtaining a fused feature of the current image frame based on the aligned feature and the feature of the current image frame includes: and obtaining the fused feature of the current image frame based on the aligned feature and the average value of the features of the current image frame.

Optionally, obtaining features of the at least one reference image frame after being respectively aligned with the current image frame based on the features of the at least one reference image frame and the at least one first similarity information includes: carrying out scale change and normalization processing on at least one piece of similarity information; and multiplying the characteristics of at least one reference image frame with the first similarity information after at least one normalization processing respectively to obtain the characteristics of at least one reference image frame aligned with the current image frame respectively.

Optionally, the features of the current image frame and the features of the at least one reference image frame are features output from any one layer except an output layer in a video processing model for image processing of the video to be processed, and the fused features are input to a layer next to the any one layer in the video processing model.

According to a second aspect of the embodiments of the present disclosure, there is provided a video judder removal method, including: inputting a video to be processed into a video processing model to obtain the characteristics of a current image frame and the characteristics of at least one reference image frame in the video to be processed output by any layer except an output layer in the video processing model, wherein the at least one reference image frame is a preset number of image frames in front of the current image frame in the video to be processed; fusing the characteristics of the current image frame and the characteristics of at least one reference image frame by the characteristic fusion method to obtain fused characteristics of the current image frame; and inputting the fused features into the next layer of any layer in the video processing model to obtain a processed video.

According to a third aspect of the embodiments of the present disclosure, there is provided a feature fusion apparatus including: the similarity obtaining unit is configured to obtain at least one piece of first similarity information based on the characteristics of a current image frame in the video to be processed and the characteristics of at least one reference image frame, wherein the at least one reference image frame is a preset number of image frames in front of the current image frame in the video to be processed, and the at least one piece of first similarity information is the similarity information between the current image frame and the at least one reference image frame respectively; the alignment unit is configured to obtain features of at least one reference image frame aligned with a current image frame respectively based on the features of the at least one reference image frame and the at least one piece of first similarity information; a fused feature obtaining unit configured to obtain a fused feature of the current image frame based on the aligned feature and the feature of the current image frame.

Optionally, the fusion feature obtaining unit is further configured to perform point multiplication on the aligned features and the features of the current image frame respectively with the features of the current image frame to obtain at least two pieces of second similarity information; multiplying the aligned features and the features of the current image frame with corresponding second similarity information in at least two pieces of second similarity information respectively to obtain at least two products; and summing at least two products to obtain the fused characteristics of the current image frame.

Optionally, the fusion feature obtaining unit is further configured to perform normalization processing on the at least two pieces of second similarity information to obtain at least two pieces of normalized second similarity information; and multiplying the aligned features and the current image frame by corresponding second similarity information in the at least two pieces of normalized second similarity information to obtain at least two products.

Optionally, the fused feature obtaining unit is further configured to obtain a fused feature of the current image frame based on the aligned feature and an average value of the features of the current image frame.

Optionally, the alignment unit is further configured to perform scale change and normalization processing on the at least one piece of similarity information; and multiplying the characteristics of at least one reference image frame with the first similarity information after at least one normalization processing respectively to obtain the characteristics of at least one reference image frame aligned with the current image frame respectively.

According to a fourth aspect of an embodiment of the present disclosure, there is provided a video judder removal apparatus including: the characteristic acquisition unit is configured to input a video to be processed into a video processing model, and obtain the characteristics of a current image frame in the video to be processed output by any layer except an output layer in the video processing model and the characteristics of at least one reference image frame, wherein the at least one reference image frame is a preset number of image frames in front of the current image frame in the video to be processed; the feature fusion unit is configured to fuse the features of the current image frame and the features of at least one reference image frame by the feature fusion method to obtain fused features of the current image frame; and the processing unit is configured to input the fused features into a layer next to any one layer in the video processing model to obtain a processed video.

According to a fifth aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement a feature fusion method and/or a video judder removal method according to the present disclosure.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by at least one processor, cause the at least one processor to perform a feature fusion method and/or a video judder removal method as described above according to the present disclosure.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a feature fusion method and/or a video jitter elimination method according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the feature fusion method and device, and the video jitter elimination method and device of the present disclosure, a feature fusion method is provided, the fusion method obtains first similarity information of the features of a current image frame in a video to be processed and the features of at least one reference image frame, wherein the at least one reference image frame is a predetermined number of image frames in front of the current image frame in the video to be processed, then based on the features of the at least one reference image frame and the at least one first similarity information, the features of the at least one reference image frame respectively aligned with the current image frame are obtained, then based on the aligned features and the features of the current image frame, the fused features of the current image frame can be obtained, it can be seen that the feature fusion method of the present disclosure does not need to label video data, does not need to train, and is directly nested on a trained model, the feature fusion can be realized, and the labor cost is reduced. Moreover, the feature fusion method disclosed by the invention is used for acquiring the fused features of the current image frame based on the similarity information of the features of the current image frame and the features of at least one reference image frame, the fused features well consider the features of the image frame in front of the current image frame, namely, all the image frames of the video are smoothly processed, and the jitter phenomenon in the related technology can not occur, so that the jitter problem in the video processing process can be well relieved after the feature fusion method is introduced in the video processing process. Therefore, the method and the device for feature fusion solve the problem that the feature fusion method in the related art requires a large amount of video data for training, so that the cost of manpower and the like is high.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic diagram illustrating an implementation scenario of a feature fusion method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow diagram illustrating a feature fusion method in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating alignment of a feature according to an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating a fused feature method insertion model in accordance with an exemplary embodiment;

FIG. 5 is a flow chart illustrating a method of video judder removal according to an exemplary embodiment;

FIG. 6 is a diagram illustrating a verification result in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating a feature fusion apparatus in accordance with an exemplary embodiment;

FIG. 8 is a block diagram illustrating a video judder removal device in accordance with an exemplary embodiment;

fig. 9 is a block diagram of an electronic device 900 according to an embodiment of the disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

The present disclosure provides a feature fusion method and a video jitter elimination method, which do not need to label video data and train, and are directly nested on a trained model to realize feature fusion, and also can well alleviate the jitter problem occurring in video processing, and a scene of face segmentation in a video is taken as an example for description below.

Fig. 1 is a schematic diagram illustrating an implementation scenario of a feature fusion method according to an exemplary embodiment of the present disclosure, and as shown in fig. 1, the implementation scenario includes a server 100, a user terminal 110, and a user terminal 120, where the user terminals are not limited to 2, and include devices such as a mobile phone and a personal computer, the user terminal may install a camera for obtaining a face image, the server may be one server, or several servers form a server cluster, or may be a cloud computing platform or a virtualization center.

The server 100 receives a video to be subjected to face segmentation sent by the

user terminal

110 or 120, after receiving the video subjected to face segmentation, the server 100 inputs the video into a trained video segmentation model, and features output by any layer except an output layer in the video segmentation model can be taken as a feature fusion method to which the present disclosure is applied, assuming that features output by an nth layer of the video segmentation model are taken as an example, for each image frame output by the nth layer: acquiring at least one piece of first similarity information based on the characteristics of a current image frame and the characteristics of at least one reference image frame, wherein the at least one reference image frame is a preset number of image frames in front of the current image frame in a video to be subjected to face segmentation, and the at least one piece of first similarity information is the similarity information between the current image frame and the at least one reference image frame respectively; obtaining the characteristics of at least one reference image frame aligned with the current image frame respectively based on the characteristics of at least one reference image frame and at least one piece of first similarity information; and obtaining the fused feature of the current image frame based on the aligned feature and the feature of the current image frame. After the fused features are obtained, the fused features are input into the N +1 th layer, and thus fusion of features output by the Nth layer is completed. The fused features are input into the next layer of any layer in the video segmentation model to obtain the segmented video, and the fused features of the current image frame are obtained based on the similarity information of the features of the current image frame and the features of at least one reference image frame, and the features of the image frame in front of the current image frame are well considered in the fused features, so that the segmented video well avoids the jitter problem in video processing.

Hereinafter, a feature fusion method and apparatus, a video jitter removal method and apparatus according to exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 2 is a flow diagram illustrating a feature fusion method according to an exemplary embodiment, as shown in fig. 2, the feature fusion method including the steps of:

in step S201, at least one piece of first similarity information is obtained based on the features of the current image frame in the video to be processed and the features of at least one reference image frame, where the at least one reference image frame is a predetermined number of image frames before the current image frame in the video to be processed, and the at least one piece of first similarity information is the similarity information between the current image frame and the at least one reference image frame, respectively. In general, the features all exist in a matrix form, and the first similarity information may also be in a matrix form, for example, the manner of obtaining at least one piece of first similarity information may be to multiply the features of the current image frame in the video to be processed with the features of at least one reference image frame, respectively, so as to obtain at least one first similarity matrix, where each first similarity matrix represents the similarity between two positions of the feature map of the current image frame and the corresponding reference image frame.

Returning to fig. 2, in step S202, features of at least one reference image frame respectively aligned with a current image frame are obtained based on the features of the at least one reference image frame and the at least one first similarity information. The aligned features may also be aligned using other alignment methods, such as optical flow alignment, which is not limited by this disclosure.

According to an exemplary embodiment of the present disclosure, obtaining features of at least one reference image frame respectively aligned with a current image frame based on the features of the at least one reference image frame and the at least one first similarity information includes: carrying out scale change and normalization processing on at least one piece of similarity information; and multiplying the characteristics of at least one reference image frame with the first similarity information after at least one normalization processing respectively to obtain the characteristics of at least one reference image frame aligned with the current image frame respectively. According to the embodiment, the subsequent fusion ratio of the reference image frame and the current image frame is controlled through scale change and normalization processing, so that the fusion characteristics are ensured to be more in line with the requirements, and the fusion effect is better.

For example, fig. 3 is a schematic diagram illustrating feature alignment according to an exemplary embodiment, as shown in fig. 3, an F matrix is a feature of a current image frame, an F1 matrix is a feature of an image frame (corresponding to the above reference image frame) before the current image frame, a similarity between two feature maps is obtained by multiplying the F matrix and the F1 matrix, that is, a first similarity matrix is obtained, then scale transformation (scale) is performed on the obtained first similarity matrix, then normalization processing (e.g., softmax) is performed, then, the F1 feature is matrix-multiplied by the first similarity matrix after softmax, and a result of calculation is aligned feature warp _ F1. The same processing is performed on other features F2 … FN (other reference image frames) to obtain warp _ F2 … warp _ FN. The change scale of the above-described scale change may be set as needed.

Returning to fig. 2, in step S203, a fused feature of the current image frame is obtained based on the aligned feature and the feature of the current image frame. It should be noted that there are various ways to obtain the fused feature of the current image frame based on the aligned feature and the feature of the current image frame, and it is only necessary to fuse the aligned feature and the feature of the current image frame, for example, an average value of the above features may be taken as the feature after fusion, a similarity between the aligned feature and the current image frame may also be obtained as a proportion of fusion, and fusion is performed according to the proportion, or of course, other ways may also be used, and this disclosure does not limit this.

The following two modes are highlighted: 1) acquiring the similarity between the aligned features and the current image frame as the fusion proportion, and fusing according to the proportion; 2) taking the average value of the characteristics as the characteristics after fusion.

According to an exemplary embodiment of the present disclosure, obtaining a fused feature of a current image frame based on the aligned feature and a feature of the current image frame includes: respectively performing point multiplication on the aligned features and the features of the current image frame to obtain at least two pieces of second similarity information; multiplying the aligned features and the features of the current image frame with corresponding second similarity information in at least two pieces of second similarity information respectively to obtain at least two products; and summing at least two products to obtain the fused characteristics of the current image frame. According to the embodiment, the similarity between the aligned features and the current image frame is obtained as the fusion proportion, and fusion is performed according to the proportion, namely, the fused features are obtained in a weighted summation mode, so that a better fusion effect can be brought.

For example, first, a similarity matrix of warp _ F1 and F, that is, second similarity information: SIM _ F1 is dot (F, warp _ F1), where SIM _ F1 dimension (shape) is (H, W, 1), and F and warp _ F1 dimension (shape) is (H, W, D), and the specific calculation method is to perform dot product processing on two features (warp _ F1 and F). Similarly, for the aligned features of other image frames (warp _ F2 … warp _ FN), the same calculation method is used to obtain SIM _ F2 … SIM _ FN, where SIM _ F is defined as dot (F, F). It should be noted that H, W is the dimension of the matrix, and D is the dimension of each value in the matrix, such as the value in the matrix may be 1, may be {1, 1}, etc. Then, taking the similarity matrix as the weight of the aligned features and the features of the current image frame, and performing weighted summation to obtain the fused features, where the fused features may be: new _ F + SIM _ F1 warp _ F1+ … SIM _ FN warp _ FN.

According to an exemplary embodiment of the present disclosure, multiplying the aligned features and the features of the current image frame by corresponding second similarity information of the at least two pieces of second similarity information, respectively, to obtain at least two products, including: normalizing the at least two pieces of second similarity information to obtain at least two pieces of normalized second similarity information; and multiplying the aligned features and the current image frame by corresponding second similarity information in the at least two pieces of normalized second similarity information to obtain at least two products. According to the embodiment, the subsequent fusion ratio of the reference image frame and the current image frame is controlled through normalization processing, so that the fusion characteristics are ensured to be more in line with the requirements, and the fusion effect is better.

For example, before weighted summation, SIM _ F1, and SIM _ F2 … FIM _ FN may be normalized, and the normalization method may use an average L1 norm/L2 norm/softmax method, which is not limited by this disclosure. For example, taking the L1 norm method as an example, the normalized second similarity information SIM _ F1 ═ SIM _ F1/(SIM _ F + SIM _ F1+ SIM _ F2+ … SIM _ FN) can obtain SIM _ F2 '… SIM _ FN' in the same way. Then, carrying out weighted summation by adopting the similarity of normalization processing, wherein the fused characteristics are as follows: new _ F ' · F + SIM _ F1 '. warp _ F1+ … + SIM _ FN '. warp _ FN.

According to an exemplary embodiment of the present disclosure, obtaining a fused feature of a current image frame based on the aligned feature and a feature of the current image frame includes: and obtaining the fused feature of the current image frame based on the aligned feature and the average value of the features of the current image frame. According to the embodiment, the fused features are obtained by the method of obtaining the average value, so that the fused features can be obtained at the fastest speed. For example, L1 Norm/L2 Norm/softmax, etc. may be used.

According to an exemplary embodiment of the present disclosure, the features of the current image frame and the features of the at least one reference image frame are features output from any one layer other than an output layer in a video processing model for image processing of a video to be processed, and the fused features are input to a layer next to the any one layer in the video processing model. According to the embodiment, the fusion characteristic method disclosed by the invention can be used in a plug-and-play manner, namely, the fusion characteristic method can be directly nested on a trained model.

For example, fig. 4 is a schematic diagram illustrating a feature fusion method insertion model according to an exemplary embodiment, as shown in fig. 4, a network framework is exemplified by U2NET (the left side is U2NET network model, which can be replaced by other network models), and a right side multi-frame fusion module (multi frame fusion) is the feature fusion method of the present disclosure, that is, a feature F of a current image frame can be extracted from output features of any layer in the middle of the model, and fused with features F1 and F2 … FN corresponding to several previous image frames of the current image frame, and a new _ F of the fused feature replaces a subsequent network of the original feature F input model.

It should be noted that the network architecture described above may be replaced by any other network, and the present disclosure does not limit this. The feature fusion method can be used in one layer of the model, and can also be used in multiple layers simultaneously, namely, each layer is subjected to feature fusion processing once, and specifically, several layers are selected for feature fusion processing, and the feature fusion processing can be determined according to the balance of processing speed and precision.

The overall framework of the present disclosure, as to how many reference image frames are selected to assist in feature fusion, can be controlled according to a balance of required speed and accuracy, and if the requirement for stability is high, a large N (e.g., 20) can be selected, and if the requirement for stability is not high, a normal N (e.g., 5) can be selected. In addition, the feature F may be output from any layer of the network model, or may be obtained from multiple layers, that is, the multiple layers are all subjected to feature fusion processing. And if the stability of the edge is desired to be high, the feature F may be an output feature of a layer close to the position of the network output, for example, the output feature of the second last layer is selected as the feature F, so that the edge stability is high, and if the stability of the edge and the whole limb is desired to be high, the 2 nd, 4 th and 8 th last layers can be simultaneously selected to perform the feature fusion processing, but the speed of the network model is affected, so that the output features of which layer is specifically selected can be balanced according to the precision and the speed.

Fig. 5 is a flow chart illustrating a video judder removal method according to an exemplary embodiment. Referring to fig. 5, the video judder removing method includes the steps of:

in step S501, a video to be processed is input into a video processing model, and features of a current image frame in the video to be processed output by any layer except an output layer in the video processing model and features of at least one reference image frame are obtained, where the at least one reference image frame is a predetermined number of image frames in front of the current image frame in the video to be processed.

In step S502, the features of the current image frame and the features of at least one reference image frame are fused by the above-mentioned feature fusion method, so as to obtain the fused features of the current image frame.

For example, based on the characteristics of a current image frame in the video to be processed and the characteristics of at least one reference image frame, acquiring at least one piece of first similarity information, wherein the at least one reference image frame is a preset number of image frames in front of the current image frame in the video to be processed, and the at least one piece of first similarity information is the similarity information between the current image frame and the at least one reference image frame respectively; obtaining the characteristics of at least one reference image frame aligned with the current image frame respectively based on the characteristics of at least one reference image frame and at least one piece of first similarity information; and obtaining the fused feature of the current image frame based on the aligned feature and the feature of the current image frame.

In step S503, the fused features are input into the layer next to any layer in the video processing model, so as to obtain a processed video.

In order to verify the feasibility of the present disclosure, the image segmentation task is taken as an example for verification, fig. 6 is a schematic diagram illustrating a verification result according to an exemplary embodiment, as shown in fig. 6, the first column is the original, the second column is the result of not introducing the method of the present disclosure, and the third column is the result of introducing the method of the present disclosure, so that it can be seen that there is false detection in the middle column, but there is no false detection after introducing the method of the present disclosure, because the method of the present disclosure performs time-sequence smoothing, these occasional false detections can be suppressed.

In summary, the feature fusion method disclosed by the invention does not need to label a large amount of real scene video data, can achieve a good effect by nesting the model trained by real scenes into the feature fusion method disclosed by the invention, does not introduce any parameter to be trained on the basis of relieving the jitter problem, is plug-and-play, can be nested on any existing trained network model, and can control the balance of speed and precision by adjusting the number of layers of the number of frames N and the number of layers of the feature F for performing feature fusion processing.

FIG. 7 is a block diagram illustrating a feature fusion apparatus according to an exemplary embodiment. Referring to fig. 7, the apparatus includes:

a similarity obtaining unit 70 configured to obtain at least one piece of first similarity information based on features of a current image frame in the video to be processed and features of at least one reference image frame, wherein the at least one reference image frame is a predetermined number of image frames in front of the current image frame in the video to be processed, and the at least one piece of first similarity information is similarity information between the current image frame and the at least one reference image frame respectively; an alignment unit 72 configured to obtain features of the at least one reference image frame aligned with the current image frame, respectively, based on the features of the at least one reference image frame and the at least one first similarity information; a fused feature obtaining unit 74 configured to obtain a fused feature of the current image frame based on the aligned feature and the feature of the current image frame.

According to an exemplary embodiment of the present disclosure, the fused feature obtaining unit 74 is further configured to perform a dot multiplication on the aligned features and the features of the current image frame with the features of the current image frame respectively to obtain at least two pieces of second similarity information; multiplying the aligned features and the features of the current image frame with corresponding second similarity information in at least two pieces of second similarity information respectively to obtain at least two products; and summing at least two products to obtain the fused characteristics of the current image frame.

According to an exemplary embodiment of the present disclosure, the fusion characteristic obtaining unit 74 is further configured to perform normalization processing on the at least two pieces of second similarity information, so as to obtain at least two pieces of normalized second similarity information; and multiplying the aligned features and the current image frame by corresponding second similarity information in the at least two pieces of normalized second similarity information to obtain at least two products.

According to an exemplary embodiment of the present disclosure, the fused feature obtaining unit 74 is further configured to obtain a fused feature of the current image frame based on the aligned features and an average value of the features of the current image frame.

According to an exemplary embodiment of the present disclosure, the alignment unit 72 is further configured to perform scale change and normalization processing on at least one piece of similarity information; and multiplying the characteristics of at least one reference image frame with the first similarity information after at least one normalization processing respectively to obtain the characteristics of at least one reference image frame aligned with the current image frame respectively.

According to an exemplary embodiment of the present disclosure, the features of the current image frame and the features of the at least one reference image frame are features output from any one layer except an output layer in a video processing model for image processing of a video to be processed, and the fused features are input to a layer next to the any one layer in the video processing model.

Fig. 8 is a block diagram illustrating a video judder removal device according to an example embodiment. Referring to fig. 8, the apparatus includes:

a feature obtaining unit 80 configured to input a video to be processed into a video processing model, and obtain features of a current image frame in the video to be processed output by any layer except an output layer in the video processing model and features of at least one reference image frame, wherein the at least one reference image frame is a predetermined number of image frames in front of the current image frame in the video to be processed; a feature fusion unit 82 configured to fuse the features of the current image frame and the features of the at least one reference image frame by the feature fusion method as described above to obtain fused features of the current image frame; and a processing unit 84 configured to input the fused features into a layer next to any layer in the video processing model to obtain a processed video.

According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 9 is a block diagram of an electronic device 900 including at least one memory 901 having a set of computer-executable instructions stored therein that, when executed by the at least one processor, perform a feature fusion method and/or a video judder removal method in accordance with embodiments of the present disclosure, and at least one processor 902, according to embodiments of the present disclosure.

By way of example, the electronic device 900 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. The electronic device 1000 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 900 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 900, the processor 902 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, the processor 902 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 902 may execute instructions or code stored in memory, where the memory 901 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 901 may be integrated with the processor 902, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 901 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 901 and the processor 902 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 902 is able to read files stored in the memory 901.

In addition, the electronic device 900 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the feature fusion method and/or the video judder removal method of the embodiments of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or optical disk memory, Hard Disk Drive (HDD), solid-state disk drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or an extreme digital (XD) card), tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic disk, a magnetic data storage device, a magnetic disk, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer device, such as a client, a host, a proxy appliance, a server, or the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across networked computer systems such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, a computer program product is provided, which includes computer instructions that, when executed by a processor, implement the feature fusion method and/or the video jitter elimination method of the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of feature fusion, comprising:

acquiring at least one piece of first similarity information based on the characteristics of a current image frame and the characteristics of at least one reference image frame in a video to be processed, wherein the at least one reference image frame is a preset number of image frames in the video to be processed before the current image frame, and the at least one piece of first similarity information is the similarity information between the current image frame and the at least one reference image frame respectively;

obtaining the characteristics of the at least one reference image frame after being respectively aligned with the current image frame based on the characteristics of the at least one reference image frame and the at least one first similarity information;

and obtaining the fused feature of the current image frame based on the aligned feature and the feature of the current image frame.

2. The feature fusion method of claim 1, wherein said deriving fused features of the current image frame based on the aligned features and the features of the current image frame comprises:

performing point multiplication on the aligned features and the features of the current image frame respectively to obtain at least two pieces of second similarity information;

multiplying the aligned features and the features of the current image frame with corresponding second similarity information in the at least two pieces of second similarity information respectively to obtain at least two products;

and summing the at least two products to obtain the fused feature of the current image frame.

3. The feature fusion method according to claim 2, wherein the multiplying the aligned features and the features of the current image frame by corresponding second similarity information of the at least two pieces of second similarity information to obtain at least two products comprises:

normalizing the at least two pieces of second similarity information to obtain at least two pieces of normalized second similarity information;

and multiplying the aligned features and the current image frame by corresponding second similarity information in the at least two pieces of normalized second similarity information to obtain at least two products.

4. The feature fusion method of claim 1, wherein said deriving fused features of the current image frame based on the aligned features and the features of the current image frame comprises:

and obtaining the fused feature of the current image frame based on the aligned feature and the average value of the features of the current image frame.

5. A video judder removal method, comprising:

inputting a video to be processed into a video processing model to obtain the characteristics of a current image frame and the characteristics of at least one reference image frame in the video to be processed output by any layer except an output layer in the video processing model, wherein the at least one reference image frame is a preset number of image frames in front of the current image frame in the video to be processed;

fusing the features of the current image frame and the features of the at least one reference image frame by the feature fusion method according to any one of claims 1 to 4 to obtain fused features of the current image frame;

and inputting the fused features into the next layer of any one layer in the video processing model to obtain a processed video.

6. A feature fusion apparatus, comprising:

a similarity obtaining unit configured to obtain at least one piece of first similarity information based on features of a current image frame and features of at least one reference image frame in a video to be processed, wherein the at least one reference image frame is a predetermined number of image frames in front of the current image frame in the video to be processed, and the at least one piece of first similarity information is similarity information between the current image frame and the at least one reference image frame respectively;

an alignment unit configured to obtain features of the at least one reference image frame aligned with the current image frame, respectively, based on the features of the at least one reference image frame and the at least one first similarity information;

a fused feature obtaining unit configured to obtain a fused feature of the current image frame based on the aligned feature and the feature of the current image frame.

7. A video judder removal apparatus, comprising:

the image processing device comprises a characteristic acquisition unit, a processing unit and a processing unit, wherein the characteristic acquisition unit is configured to input a video to be processed into a video processing model, and obtain the characteristics of a current image frame in the video to be processed output by any layer except an output layer in the video processing model and the characteristics of at least one reference image frame, wherein the at least one reference image frame is a preset number of image frames in front of the current image frame in the video to be processed;

a feature fusion unit configured to fuse the features of the current image frame and the features of the at least one reference image frame by the feature fusion method according to any one of claims 1 to 4 to obtain fused features of the current image frame;

a processing unit configured to input the fused feature into a layer next to the any layer in the video processing model to obtain a processed video.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the feature fusion method of any one of claims 1 to 4 and/or the video judder removal method of claim 5.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the feature fusion method of any one of claims 1 to 4 and/or the video judder removal method of claim 5.

10. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by a processor, implement the feature fusion method according to any one of claims 1 to 4 and/or the video jitter elimination method according to claim 5.