CN111950653A

CN111950653A - Video processing method and device, storage medium and electronic equipment

Info

Publication number: CN111950653A
Application number: CN202010858714.1A
Authority: CN
Inventors: 王晟玮; 汪亮
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Yayue Technology Co ltd
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2020-11-17
Anticipated expiration: 2040-08-24
Also published as: CN111950653B

Abstract

The invention discloses a video processing method and device, a storage medium and electronic equipment. The method comprises the following steps: acquiring a target video to be processed; sequentially extracting features of each video frame in the target video to obtain an image feature set corresponding to each video frame; dividing all video frames in a target video according to an image feature set to obtain a first scene video frame list; sequentially acquiring the feature similarity between each key video frame and a reference video frame positioned in front of the key video frame; under the condition that the feature similarity reaches a merging condition, merging a scene video frame sequence in a first scene where the key video frame is positioned into a scene video frame sequence in a second scene where the reference video frame is positioned so as to update the first scene video frame list into a second scene video frame list; and carrying out segmentation processing on the target video according to the second scene video frame list. The invention solves the technical problem of low accuracy of video segmentation processing in the related technology.

Description

Video processing method and device, storage medium and electronic equipment

Technical Field

The invention relates to the field of computers, in particular to a video processing method and device, a storage medium and electronic equipment.

Background

After the video playing platform obtains the original video file from the copyright side, the original video file is usually required to be converted into a standard code stream meeting the requirements, and then the standard code stream can be distributed to each user client side for playing. In the conversion process, the transcoding console usually divides the original video file to form a plurality of video segments, so as to enhance and encode the video quality of each video segment, and finally merges the plurality of encoded video segments to obtain a complete video stream file. In order to make the playing quality of consecutive video frames in the same scene consistent, it is required to divide the video according to the scene.

However, in the video segmentation methods provided by the related art, video segmentation is usually implemented for a single feature of a video, for example, scene analysis is performed based on semantics of subtitles or speech, and then video segmentation is performed according to an analysis result. However, the scene analysis by using the single feature is incomplete, and the application range of the segmentation is limited, so that the video segmentation accuracy is low.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a video processing method and device, a storage medium and electronic equipment, which are used for at least solving the technical problem of low accuracy of video segmentation processing in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a video processing method, including: acquiring a target video to be processed; sequentially extracting features of each video frame in the target video to obtain an image feature set corresponding to each video frame, wherein the image feature set comprises at least two image features of the video frames; dividing all video frames in the target video according to the image feature set to obtain a first scene video frame list, wherein scene video frame sequences respectively corresponding to a plurality of scenes contained in the target video are recorded in the first scene video frame list, and a first video frame in each scene video frame sequence is a key video frame of the scene; sequentially acquiring the feature similarity between each key video frame and a reference video frame positioned in front of the key video frame; under the condition that the characteristic similarity reaches a merging condition, merging a scene video frame sequence in a first scene where the key video frame is positioned into a scene video frame sequence in a second scene where the reference video frame is positioned so as to update the first scene video frame list into a second scene video frame list; and carrying out segmentation processing on the target video according to the second scene video frame list.

According to another aspect of the embodiments of the present invention, there is also provided a video processing apparatus, including: the first acquisition unit is used for acquiring a target video to be processed; the first extraction unit is used for sequentially extracting the features of each video frame in the target video to obtain an image feature set corresponding to each video frame, wherein the image feature set comprises at least two image features of the video frames; a dividing unit, configured to divide all video frames in the target video according to the image feature set to obtain a first scene video frame list, where scene video frame sequences respectively corresponding to multiple scenes included in the target video are recorded in the first scene video frame list, and a first video frame in each scene video frame sequence is a key video frame of the scene; a second obtaining unit, configured to sequentially obtain feature similarity between each of the key video frames and a reference video frame located before the key video frame; a merging updating unit, configured to merge a scene video frame sequence in a first scene in which the key video frame is located into a scene video frame sequence in a second scene in which the reference video frame is located, so as to update the first scene video frame list into a second scene video frame list, when the feature similarity meets a merging condition; and the division processing unit is used for carrying out division processing on the target video according to the second scene video frame list.

According to still another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to execute the above-mentioned video processing method when running.

According to still another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory and a processor, the memory having a computer program stored therein, the processor being configured to execute the video processing method described above through the computer program.

In the embodiment of the invention, after the image feature set corresponding to each video frame in the target video is obtained, all the video frames in the target video are primarily divided by using the image feature set, so as to obtain the first scene video frame list. And then acquiring the feature similarity between the key video frame and the reference video frame of each scene, and further determining whether to merge a scene video frame sequence in a first scene where the key video frame is located and a scene video frame sequence in a second scene where the reference video frame is located according to a judgment result of whether the feature similarity reaches a merging condition, so as to further update the first scene video frame list, obtain a second scene video frame list, and segment the target video according to the second scene video frame list. That is to say, after a plurality of image features are fused to divide a target video to obtain a first scene video frame list, scene relevance among all video frames in the target video is analyzed by combining feature similarity of the video frames, so that comprehensive fine analysis of scene characteristics of the video frames is realized, the video segmentation processing accuracy is improved without being limited to an analysis result of a single feature, and the problem of low accuracy of video segmentation processing in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of a hardware environment for an alternative video processing method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of an alternative video processing method according to an embodiment of the invention;

FIG. 3 is a flow diagram of another alternative video processing method according to an embodiment of the invention;

fig. 4 is a schematic network structure diagram of a neural network in an alternative video processing method according to an embodiment of the present invention;

FIG. 5 is a flow diagram of yet another alternative video processing method according to an embodiment of the present invention;

FIG. 6 is a flow diagram of yet another alternative video processing method according to an embodiment of the present invention;

FIG. 7 is a flow diagram of yet another alternative video processing method according to an embodiment of the present invention;

FIG. 8 is a block diagram of an alternative video processing apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that, the video processing method provided by the embodiment of the present application relates to the following technical terms:

HSV: (Hue, Saturation, Value). Hue, saturation, and lightness are color coding methods commonly used in the field of images.

YUV: a color coding method is commonly used for video. Y represents luminance, and U and V represent chrominance.

CNN: the conditional neural network. A convolutional neural network.

Sift: Scale-Invariant Feature Transform. The feature change with unchanged scale is also a feature extraction algorithm in image processing.

MFCC: mel-frequency cepstral coefficients. Mel-frequency cepstral coefficients, a feature of audio data.

SSIM: structural similarity index measurement. A structural similarity measure, an image quality evaluation index, may also be used to evaluate the similarity of images.

According to an aspect of the embodiments of the present invention, there is provided a video processing method, optionally as an optional implementation manner, the video processing method may be applied to, but is not limited to, a video processing system in an environment as shown in fig. 1, where the video processing system may include, but is not limited to, a terminal device 102, a network 104, and a server 106. Here, the terminal device 102 includes a human-machine interaction screen 1022, a processor 1024, and a memory 1026. The human-computer interaction screen 1022 is used to present a target video. The processor 1024 is configured to transmit the target video to the server 106, and the memory 1026 is configured to store the video resource of the target video.

In addition, the server 106 includes a database 1062 and a processing engine 1064, where the database 1062 is used to store a scene video frame list corresponding to the target video and a plurality of divided video clips. The processing engine 1064 is configured to perform segmentation processing on the target video by using the method provided in this embodiment.

The specific process comprises the following steps: in steps S104 to S112, after the server 106 acquires the target video, feature extraction is sequentially performed on each video frame in the target video to obtain an image feature set corresponding to each video frame. And then, dividing all video frames in the target video according to the image feature set to obtain a first scene video frame list, wherein scene video frame sequences respectively corresponding to a plurality of scenes contained in the target video are recorded in the first scene video frame list, and the first video frame in each scene video frame sequence is a key video frame of the scene. The method comprises the steps of sequentially obtaining feature similarity between each key video frame and a reference video frame located in front of the key video frame, merging a scene video frame sequence in a first scene where the key video frame is located into a scene video frame sequence in a second scene where the reference video frame is located under the condition that the feature similarity meets a merging condition, updating the first scene video frame list into a second scene video true list, and accordingly achieving segmentation processing on a target video according to the second scene video frame list.

It should be noted that, in this embodiment, after the image feature set corresponding to each video frame in the target video is obtained, all video frames in the target video are primarily divided by using the image feature set, so as to obtain the first scene video frame list. And then acquiring the feature similarity between the key video frame and the reference video frame of each scene, and further determining whether to merge a scene video frame sequence in a first scene where the key video frame is located and a scene video frame sequence in a second scene where the reference video frame is located according to a judgment result of whether the feature similarity reaches a merging condition, so as to further update the first scene video frame list, obtain a second scene video frame list, and segment the target video according to the second scene video frame list. That is to say, after a plurality of image features are fused to divide a target video to obtain a first scene video frame list, scene relevance among all video frames in the target video is analyzed by combining feature similarity of the video frames, so that comprehensive fine analysis of scene characteristics of the video frames is realized, the video segmentation processing accuracy is improved without being limited to an analysis result of a single feature, and the problem of low accuracy of video segmentation processing in the related technology is solved.

Optionally, in this embodiment, the terminal device may include, but is not limited to, at least one of the following: mobile phones (such as Android phones, iOS phones, etc.), notebook computers, tablet computers, palm computers, MID (Mobile Internet Devices), PAD, desktop computers, smart televisions, etc. Such networks may include, but are not limited to: a wired network, a wireless network, wherein the wired network comprises: a local area network, a metropolitan area network, and a wide area network, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communication. The server may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The above is merely an example, and this is not limited in this embodiment.

It should be noted that the above-mentioned video processing method may be executed by a terminal or a server independently, or executed by a terminal device and a server cooperatively, and the following description in the embodiment of the present application takes an example in which the method is executed by a server.

Optionally, as an optional implementation manner, as shown in fig. 2, the video processing method includes:

s202, acquiring a target video to be processed;

s204, sequentially extracting the features of each video frame in the target video to obtain an image feature set corresponding to each video frame, wherein the image feature set comprises at least two image features of the video frames;

s206, dividing all video frames in the target video according to the image feature set to obtain a first scene video frame list, wherein scene video frame sequences respectively corresponding to a plurality of scenes in the target video are recorded in the first scene video frame list, and the first video frame in each scene video frame sequence is a key video frame of the scene;

s208, sequentially acquiring the feature similarity between each key video frame and a reference video frame positioned in front of the key video frame;

s210, merging a scene video frame sequence in a first scene where the key video frame is positioned into a scene video frame sequence in a second scene where the reference video frame is positioned under the condition that the feature similarity reaches a merging condition so as to update the first scene video frame list into a second scene video frame list;

and S212, segmenting the target video according to the second scene video frame list.

Optionally, in this embodiment, the video processing method may be but is not limited to be applied to a server corresponding to each target client that carries video content. Here, the target client is, for example, a video sharing client, a video playing client, or the like. In order to ensure the viewing experience of the target client, after the original video file is obtained from the copyright side, transcoding processing is usually performed on the original video file once, so that the original video file is converted into a standard code stream meeting requirements, and then the standard code stream is distributed to each target client for playing and displaying. In the transcoding process, the transcoding console needs to segment the original video file according to the scene to form a plurality of video segments. So that the transcoding central station performs video quality enhancement and coding on each video segment in a distributed manner. And finally, merging the code streams of all the video clips to form a complete video stream file to be played. And pushing the video stream file to be played to each target client for playing and displaying. That is to say, the video processing method provided in this embodiment is used to perform comprehensive fine segmentation on the original video file according to the scene, so that it is further ensured that the front and rear picture qualities of the continuous video frames in the same scene in the video stream file obtained based on the merging are continuous, and the situation that the quality of the played picture in the video stream file is not uniform due to the fact that the video segments obtained by dividing the original video file by using the prior art are inaccurate, so that the continuous video frames in the same scene appear in different video segments, and different image coding processes are used for different video segments is avoided. In other words, according to the video processing method provided in the embodiment of the present application, the accuracy of a scene video frame list obtained by video division is ensured by using an analysis result of multi-feature fusion, so that different image enhancement, picture quality improvement and other processing are performed on each accurately divided video segment according to different scenes, thereby improving the playing smoothness and picture quality continuity of a video stream file obtained by integrating the video segments, and achieving an effect of improving the viewing experience of a user.

Optionally, in this embodiment, the image feature set may include, but is not limited to: and taking the mean value of the values of the corresponding image color component parameters on each pixel point in the video frame as an image characteristic of the video frame. The image color component parameters may include, but are not limited to, parameters in the target color coding space: hue, Saturation, lightness (Hue, Value, HSV for short). For example, an average value of hues of each pixel point of a video frame is obtained and used as the hue feature of the video frame; acquiring the mean value of the saturation of each pixel point of a video frame as the saturation characteristic of the video frame; and acquiring the average value of the brightness of each pixel point of a video frame as the brightness characteristic of the video frame. That is to say, assuming that the original format of the video frame is YUV format, and mapping it to HSV color coding space, the average value of the values of each pixel point in the video frame on three image color component parameters of hue, saturation and lightness can be obtained to serve as the image feature set corresponding to the video frame. Here, the parameters involved in the image feature set are not limited in this embodiment, and may also be RGB three-color-component parameters or other parameters for indicating image characteristics.

Optionally, in this embodiment, the dividing all video frames in the target video according to the image feature set to obtain the first scene video frame list may include, but is not limited to: and calculating the average value of each image feature in the image feature set to obtain the target image feature matched with the video frame. Further, all video frames in the target video are divided by using the difference indicated by the comparison result of the target image characteristics corresponding to the two adjacent video frames, so as to obtain a first scene video frame list.

For example, taking HSV feature of video frame as an example, H of ith video frame is obtained_avg、S_avg、V_avgAfter an image feature set formed by three image features is obtained, the mean value HSV of the three image features is obtained_avgAnd taking the image as the target image characteristic of the ith video frame. And then calculating and acquiring the target image characteristics of the adjacent (i + 1) th video frame by referring to the process. By comparing the difference (e.g., calculating the distance between the two), it is determined whether the ith video frame and the (i + 1) th video frame are video frames in the same scene. If the difference is smaller than a certain threshold value, the two video frames are combined to be used as video frames in the same scene, and if the difference is larger than the certain threshold value, the two video frames are respectively used as video frames in different scenes. That is, HSV based on the use of video frames_avgThe primary feature performs a primary division on the target video according to scenes to generate a video frame sequence recorded according to different scenes, and the video frame sequence is used as the first scene video frame list.

Optionally, in this embodiment, the feature similarity between the key video frame and the reference video frame before the key video frame may include, but is not limited to: the cosine distance between the key feature vector of the key video frame and the reference feature vector of the reference video frame, the first proportion of the matching feature points between the key video frame and the reference video frame in the key video frame, and the second proportion of the matching feature points between the key video frame and the reference video frame in the reference video frame. The key characteristic vector and the reference characteristic vector are obtained by processing based on a convolutional neural network; the matching feature points are obtained by respectively extracting and comparing feature points of the key video frames and the reference video frames by using a Sift feature operator. In addition, the reference video frame may be, but is not limited to, a video frame before and adjacent to the key video frame, i.e., a video frame before the key video frame. That is, the key video frames in the first scene video frame list and the high-level features in the reference video frames adjacent before the key video frames are analyzed using the convolutional neural network and the Sift feature operator. Further using a feature fusion algorithm to judge the similarity of the two frames, and if the two frames are similar, merging the scenes; otherwise, the original scene division is reserved, so that the first scene video frame list is updated to the second scene video frame list, and the purpose of performing refined division updating on the scene in the target video is achieved.

The convolutional neural network may be, but not limited to, MobileNet V2, and may also use networks such as inclusion Net and PasNet, all of which can achieve the same function. In addition, Sift herein refers to scale invariant feature transformation, which is a computer vision algorithm with translation, rotation and scale invariance. The method comprises the following steps: 1) constructing a scale space, detecting extreme points and obtaining scale invariance; 2) filtering and accurately positioning the characteristic points, and removing unstable characteristic points; 3) extracting feature descriptors at the feature points and distributing direction values to the feature points; 4) and generating a feature descriptor, and searching a matching point by using the feature descriptor.

According to the embodiment provided by the application, after the target video is divided by fusing a plurality of image features to obtain the first scene video frame list, the scene relevance among all the video frames in the target video is analyzed by combining the feature similarity of the video frames, so that the scene characteristics of the video frames are comprehensively and finely analyzed without being limited to the analysis result of a single feature, the accuracy of video segmentation processing is improved, and the problem of low accuracy of video segmentation processing in the related technology is solved.

As an optional scheme, sequentially performing feature extraction on each video frame in the target video to obtain an image feature set corresponding to each video frame includes:

s1, sequentially taking each video frame in the target video as a current video frame to execute the following feature extraction operations until all video frames in the target video are traversed:

s2, mapping the current video frame to a target color coding space to extract the parameter value of each pixel point in the current video frame mapped to each image color component parameter in the target color coding space, wherein the target color coding space comprises at least two image color component parameters;

and S3, determining an image feature set matched with the current video frame according to the parameter value of the image color component parameter of each pixel point.

Optionally, in this embodiment, determining, according to the parameter value of the image color component parameter of each pixel point, an image feature set matched with the current video frame includes: obtaining the mean value of the parameter values of the ith image color component parameters of each pixel point to obtain the ith image characteristic of the current video frame, wherein i is an integer which is greater than or equal to 1 and less than or equal to N, N is the number of the image color component parameters in the target color coding space, and N is a positive integer.

Optionally, in this embodiment, the target color coding space may be, but is not limited to, an HSV color space, where the following image color component parameters are included: hue, Saturation, lightness (Hue, Value, HSV for short). When the video frame is mapped to the target color coding space, each pixel point in the video frame is mapped to the target color coding space, so that the parameter value of each image color component parameter of each pixel point in the target color coding space is obtained. Further, the mean value of the ith image color component parameter is used as the ith image feature in the image feature set corresponding to the video frame.

The description is made with specific reference to the following examples: the original format of each video frame in the target video can be, but is not limited to, YUV format, when the video frames are mapped to HSV color space, each video frame can be sequentially used as a current video frame, the height and width in the current video frame are assumed and are respectively represented by M and N, (i, j) represents the position coordinate of any pixel point in the current video frame, and H, S and V respectively represent the values of three image color component parameters of the pixel point. Computing the acquisition of the corresponding set of image features by the following formula will include the following image features: .

Wherein H_avgMean value, S, for representing the value of the tonal component parameter of each pixel in the current video frame_avgMeans, V, for representing the value of the saturation component parameter of each pixel in the current video frame_avgThe average value is used for representing the value of the brightness component parameter of each pixel point in the current video frame.

And obtaining three image characteristics corresponding to the three image color component parameters of the current video frame based on the calculation result to form an image characteristic set corresponding to the current video frame. And acquiring the image feature set corresponding to each video frame in the target video by referring to the mode.

For the above video processing method, the following steps are specifically combined with the steps shown in fig. 3 to describe a full flow:

after the target video is obtained in step S302, step S304 is executed to extract HSV features in each video frame, and a first scene video frame list is obtained through comparison. Then, in step S306-1, respective feature vectors are generated based on the key video frame in the first scene video frame list and a reference video frame before the key video frame (i.e., a video frame before the key video frame), and the feature vectors are compared to obtain a cosine distance. In addition, in step S306-2, a Sift operator is used to extract Sift feature points of the key video frame and the reference video frame, and a matching feature point is obtained by comparison, and then a first ratio of the matching feature point in the feature point set corresponding to the key video frame and a second ratio of the matching feature point in the feature point set corresponding to the reference video frame are obtained. Finally, in step S308, the features are fused to determine to merge or retain the sequence of video frames in the first scene video frame list, so as to update the second scene video frame list. Therefore, the target video is divided based on the video frame sequence of each finely divided scene in the second scene video frame list, so as to ensure the accuracy of the obtained video segments, and further ensure the uniformity of the picture quality of the video stream file obtained by re-merging.

According to the embodiment provided by the application, the parameter values of the image color component parameters corresponding to the pixel points of each video frame in the target video in the target color coding space are sequentially obtained, so that the image features in the image feature set corresponding to the video frame are obtained through calculation by using the parameter values, and therefore, the primary division of all the video frames in the target video is realized by using the image features, and the coarsely divided first scene video frame list is obtained.

As an optional scheme, dividing all video frames in the target video according to the image feature set to obtain the first scene video frame list includes:

s1, obtaining the mean value of each image feature in the image feature set, and taking the mean value of the image features as the target image features matched with the video frames;

s2, sequentially comparing the target image characteristics corresponding to the two adjacent video frames to obtain a comparison result;

and S3, dividing all the video frames according to the comparison result to obtain a first scene video frame list.

The description is made with specific reference to the following examples: assuming that the description continues with the above scenario, the image feature set (H) corresponding to the current video frame is obtained_avg，S_avg，V_avg) Then, the mean value of the image features of the current video frame (i.e. the mean value of HSV) can be calculated using the following formula:

then, the mean value HSV of the image characteristics_avgAnd the target image characteristics are used as the target image characteristics corresponding to the current video frame, and the target image characteristics are utilized to further participate in comparison so as to obtain a first scene video frame list.

Optionally, in this embodiment, sequentially comparing the target image features corresponding to the two adjacent video frames, and obtaining a comparison result includes: acquiring a characteristic difference value of a target image characteristic of a j +1 th video frame and a target image characteristic of the j video frame, wherein j is an integer which is greater than or equal to 1 and less than or equal to M-1, and M is the number of video frames in a target video; and comparing the characteristic difference value with a first threshold value to obtain a comparison result.

Optionally, in this embodiment, the dividing all the video frames according to the comparison result to obtain the first scene video frame list includes: under the condition that the comparison result indicates that the characteristic difference value is smaller than a first threshold value, determining that the j +1 th video frame and the j th video frame are in the same scene, and adding the j +1 th video frame to a scene video frame sequence in which the j th video frame is located; and under the condition that the comparison result indicates that the characteristic difference value is larger than or equal to the first threshold value, determining that the j +1 th video frame and the j th video frame are not in the same scene, and creating a new scene video frame sequence for the j +1 th video frame.

The description is made with specific reference to the following examples: the target image characteristic HSV of each frame of video frame is calculated and obtained through the formula_avgThen, the target image features of two adjacent frames can be compared to obtain a feature difference value. Further using the characteristic difference value and a first threshold (e.g. using T)_HSVRepresentation) to determine whether to merge the two adjacent video frames into a sequence of video frames in the same scene.

For example, if the feature difference of the target image features of two adjacent frames (i.e. the j +1 th video frame and the j video frame) is less than the threshold T_HSVAnd if so, the j +1 th video frame and the j video frame are considered to belong to the same scene, and the j +1 th video frame and the j video frame are stored in a scene frame sequence corresponding to the same scene. If the feature difference value of the target image features of two adjacent frames (i.e. the j +1 th video frame and the j th video frame) is greater than the threshold value T_HSVAnd if the two scenes are not the same scene, storing the j +1 th video frame into a video frame sequence of a new scene, and so on until all video frames in the target video are traversed to obtain an initial first scene video frame list. Wherein the threshold value T is_HSVMay be set to, but is not limited to, 25.

Assuming that the above operations are performed on 10 video frames in the target video, the process may be as follows:

obtaining the value of each image color component parameter of each pixel point in each video frame in HSV color space, and calculating the mean value to obtain an image feature set (H)_avg，S_avg，V_avg). Then, mean value calculation is carried out based on the image feature set so as to obtain target image features of all video frames, such as HSV (hue, saturation, value) in sequence₁、HSV₂、HSV₃…HSV₁₀. Then, the feature difference value of the target image features of two adjacent video frames is sequentially obtained and is compared with a first threshold value T_HSVAnd (6) carrying out comparison. Such as HSV comparison₁-HSV₂Difference sum of (1)_HSVObtaining a difference value smaller than T_HSVIf it is determined that the 1 st video frame (frame id 1) and the 2 nd video frame (frame id 2) are the same scene, the two frames may be stored in the video frame list as a video frame sequence corresponding to the scene, such as the data recorded in the first item shown in table 1. Comparison of HSV₂-HSV₃Difference sum of (1)_HSVTo obtain a difference value greater than T_HSVIf it is determined that the second video frame and the third video frame are not the same scene, the 3 rd video frame (frame id 3) is stored in the video frame list corresponding to the scene two, such as the data recorded in the second entry shown in table 1. And in the same way, obtaining the video frame sequence corresponding to each scene so as to obtain the first scene video frame list after the initial division.

TABLE 1

Scene	Sequence of scene frames (frame identification)
		A	1、2
II	3
		III	4、5、6

TABLE 1 (continuation)

Scene	Sequence of scene frames (frame identification)
		Fourthly	7、8
Five of them	9、10

According to the embodiment provided by the application, after the target image characteristics matched with each video frame are obtained, the characteristic difference value between two adjacent target image characteristics and the first threshold value are compared, and whether the two adjacent video frames are in the same scene or not is further determined according to the comparison result, so that the first scene video frame list is obtained through rapid division.

As an alternative, sequentially obtaining the feature similarity between each key video frame and the reference video frame located before the key video frame includes:

s1, acquiring a key feature vector of the key video frame and a reference feature vector of the reference video frame;

s2, obtaining cosine distance between the key feature vector and the reference feature vector, wherein the feature similarity comprises the cosine distance;

s3, acquiring matching feature points in the key video frame and the reference video frame;

s4, acquiring a first proportion of the matched feature points in the key video frame and a second proportion of the matched feature points in the reference video frame, wherein the feature similarity comprises the first proportion and the second proportion.

Optionally, in this embodiment, the obtaining the key feature vector of the key video frame and the reference feature vector of the reference video frame includes: respectively preprocessing the key video frame and the reference video frame to obtain a candidate key video frame and a candidate reference video frame; inputting the candidate key video frame into a lightweight convolutional neural network to obtain a key characteristic vector, and inputting the candidate reference video frame into the lightweight convolutional neural network to obtain a reference characteristic vector, wherein the lightweight convolutional neural network is a neural network for generating a characteristic vector of an image, which is obtained after machine training is performed by utilizing a plurality of groups of sample image pairs and corresponding label information, each group of sample image pairs in the plurality of groups of sample image pairs comprises a first frame image in a first sample scene and a last frame image in a second sample scene, the second sample scene is adjacent to and before the first sample scene, and the label information comprises a scene label of the first frame image and a scene label of the last frame image.

Optionally, in this embodiment, the preprocessing may include, but is not limited to: and carrying out format conversion and size adjustment on the key video frame and the reference video frame to obtain the input requirement of the light-weight convolutional neural network. For example, if the lightweight convolutional neural network is a MobileNet V2 network, the video frames may be adjusted to three channel images with an RGB format with a size of 224 × 224, such as obtaining candidate key video frames with an RGB format with a size of 224 × 224 and candidate reference video frames with an RGB format with a size of 224 × 224. That is, in practical application, the memory occupation limitation and the accuracy requirement can be adjusted according to the neural network model of practical application.

In addition, in this embodiment, the lightweight convolutional neural network may be, but not limited to, a MobileNet V2 network, and may also be an inclusion Net, PasNet, or other network, which all may achieve the same function.

Specifically, it is assumed that the lightweight convolutional neural network here is illustrated by taking a MobileNet V2 network as an example: it is assumed that the network structure of the above-described MobileNet V2 network can be as shown in fig. 4. The MobileNet V2 network includes: after one convolutional layer, 17 depth separable convolutional layers were provided, and finally 2 convolutional layers and 1 pooling layer were connected. After processing through the above network, a feature vector of 1280 × 1 is generated.

The application of the MobileNet V2 network specifically combining the above network structure in the embodiment of the present application is to the key video frame F_nAnd reference video frame F_n-1The following operations are performed:

for the key video frame F in the first scene video frame list_nAnd performing down sampling to obtain 224 × 224 images, and converting the images into RGB format images to obtain candidate key video frames. The candidate key video frame is then used as input to a MobileNet V2 network, resulting in a key video frame F_nMatched key feature vector I with dimension 1280 x 1_n. In addition, for reference video frame F_n-1The same processing is also performed to obtain the reference video frame F_n-1Matched reference feature vector I with dimension 1280 x 1_n-1. Finally, the following formula is used to calculate the key feature vector I_nAnd a reference feature vector I_n-1Cosine distance of (d):

wherein d is_cosAs a key feature vector I_nAnd a reference feature vector I_n-1May range from, but is not limited to, 0 to 1.

In addition, in the present embodiment, it is possible to, but not limited to, extract feature points in the key video frame using the Sift operator, and extract feature points in the reference video frame. And then comparing the two to obtain a matching feature point, and further acquiring a first proportion of the matching feature point in the key video frame and a second proportion of the matching feature point in the reference video frame.

For example, assume a key video frame F_nThe number of the middle characteristic points is N_nReference video frame F_n-1The number of the characteristic points is marked as N_n-1And the number of matching feature points is M. The ratio can then be found using the following formula:

wherein p is_nFor indicating matching feature points in key video frame F_nFirst ratio of (1), p_n-1For indicating matching feature points in reference video frame F_n-1The second fraction of (1).

It should be noted that, in this embodiment, the above-mentioned manner is an example, and methods such as logical or multi-core learning, correlation multivariate statistics, and the like may also be used for the two high-level features, i.e., the ratio between the feature vector and the matching feature point. In practical application, the precision ratio and the recall ratio can be adjusted according to the requirements in practical application.

According to the embodiment provided by the application, the feature vectors of the key video frame and the reference video frame are extracted through the convolutional neural network, the matched feature points of the key video frame and the reference video frame are obtained through the Sift operator, the first proportion and the second proportion are obtained through calculation, and therefore the feature similarity of the first proportion and the second proportion is obtained, and whether the first scene video frame list is updated or not is determined by utilizing the feature similarity. That is to say, the two high-level features of the feature vector and the proportion of the matched feature point are interpreted in a feature fusion mode to determine whether to recombine the video frames or not, and an updated second scene video frame list is obtained.

Optionally, in this embodiment, before acquiring the target video to be processed, the method further includes: acquiring a plurality of sample videos, and extracting a plurality of groups of image pairs in each sample video; taking each group of image pairs as a current group of image pairs, and executing the following operations until reaching the convergence condition of the lightweight convolutional neural network: inputting a first frame image in a first sample scene in a current group of image pairs into a first training convolutional neural network to obtain a first characteristic vector, and inputting a last frame image in a second sample scene in the current group of image pairs into a second training convolutional neural network to obtain a second characteristic vector, wherein a twin network structure is used during training of the lightweight convolutional neural network, the twin network structure comprises the first training convolutional neural network and the second training convolutional neural network, and the first training convolutional neural network and the second training convolutional neural network share a training weight; obtaining a cosine distance between the first feature vector and the second feature vector, and taking the cosine distance between the first feature vector and the second feature vector as the feature distance; inputting the characteristic distance and the label information into a loss function to calculate to obtain a current loss value; obtaining a loss value difference value of a current loss value and a last loss value of the current loss value; and under the condition that the loss value difference indicates that the twin network structure reaches the convergence condition, taking the first training convolutional neural network or the second training convolutional neural network which finishes training at present as a lightweight convolutional neural network.

It should be noted that, in this embodiment, the lightweight convolutional neural network may be a MobileNet V2 network, and the training procedure of the MobileNet V2 network may be as follows:

and selecting a plurality of sample videos to cover various categories as much as possible. And then, pre-dividing the sample video by using an HSV feature extraction method provided in the embodiment of the application, and reserving a first frame image and a last frame image of each divided scene. Pairing the last frame image (reference video frame) of the last scene with the first frame image (key video frame) of the current scene to form a group of image pairs, and manually marking whether the images are scene labels of the same scene to form a data set for training. And then, dividing the data set, selecting 75% of data as a training set, and using the rest 25% of data as a testing set. Moreover, it needs to be ensured that the proportion of the positive example to the negative example in the training is about 1: 1.

and then building a training network, wherein the network used in the training is a twin network structure, and as shown in fig. 5, building two MobileNet V2 networks (namely a first training convolutional neural network and a second training convolutional neural network) with the same structure.

Specifically, for each image pair (e.g., image 1 and image 2) in the data set described above, two networks of MobileNet V2 (e.g., MobileNet V2-1 and MobileNet V2-2) are input, respectively, as in steps S502-1 and S502-2. Then, as shown in steps S504-1 and S504-2, two feature vectors (e.g., feature vector 1 and feature vector 2) are generated through the processing of the MobileNet V2 network, as shown in steps S506-1 and S506-2. Finally, in step S508, a cosine distance is calculated for the two feature vectors.

Meanwhile, it should be noted that the two MobileNet V2 networks in the twin structure share weights during training. That is, after each training, the weighting parameters in one of the networks (e.g., MobileNet V2-1) are fixed, and the weighting parameters in the other network (e.g., MobileNet V2-2) are updated by back-propagation. And then, directly synchronizing the updated weight parameters to the MobileNet V2-1 network, thereby realizing the synchronous training update of the two MobileNet V2 networks, and after the convergence condition is reached, taking any one of the two networks as the actual service.

Further, the loss function established during the training process may be as follows:

where N represents the total number of image pairs in the dataset used for training. The index i represents the sample number currently processed. y is_iLabels representing the current exemplar (i.e. the i-th group of image pairs), y being the case when both images of the image pair as exemplar belong to the same scene_iIs 1; when the two images in the sample do not belong to the same scene, y_iIs 0. d_cos，iThe cosine distance of two images in the current sample is represented, and when the two images belong to the same scene, the cosine distance approaches to 1; when the two images do not belong to the same scene, the cosine distance approaches 0.

Further, as can be seen from the above equation (5), the loss function is composed of two parts, the former part is the cosine distance, and the latter part is the weight attenuation, so that the network can learn smoother weights, and the generalization capability of the network is improved.

In addition, in this embodiment, in the training process, the difference between the loss values of the two adjacent training output results may be obtained, and the convergence condition of the training is determined to be reached when the difference of consecutive times is less than a certain threshold. Wherein, the loss value of each time is the label information y_iAnd cosine distance d_cos，iThe L calculated by the above equation (5) is input.

After the training of the network is completed, the network performance needs to be tested by using a pre-separated test set, the network accuracy, the recall ratio and other indexes are calculated, and whether the network meets the service requirements or not is evaluated. If the network is in accordance with the preset standard, the network is put into use, and if the network is not in accordance with the preset standard, the network is trained for the second time.

According to the embodiment provided by the application, the key characteristic vector corresponding to the key video frame and the reference characteristic vector corresponding to the reference video frame are obtained through calculation of the lightweight convolutional neural network. The lightweight convolutional neural network is obtained by adopting twin network structure training, so that a scene video frame list is further finely divided based on the high-level characteristics, and the accuracy of video segmentation processing is improved.

As an alternative, the obtaining the matching feature points in the key video frame and the reference video frame includes:

s1, converting the key video frame into a key video frame gray image and converting the reference video frame into a reference video frame gray image;

s2, extracting a key feature point set from the key video frame gray image by using a scale-invariant feature change operator, and extracting a reference feature point set from the reference video frame gray image;

and S3, comparing the key characteristic point set with the reference characteristic point set to obtain the matching characteristic points.

In this embodiment, in obtainingAfter the first scene video frame list is obtained, the Sift feature operator may be used to analyze to obtain the matching feature points, and the specific process may refer to the flow illustrated in fig. 6. Specifically, assume that the reference video frame F currently to be processed is determined as in step S602-1 and step S604-1_n-1And key video frame F_nThen, the input video frames are respectively converted into gray level images, as in step S604-1, a gray level image of a reference video frame is obtained, as in step S604-2, a gray level image of a key video frame is obtained. And executing the step S606-1 and the step S606-2, and respectively extracting the feature points by using a Sift operator to obtain a key feature point set and a reference feature point set. And will key video frame F_nThe number of the characteristic points is marked as N_nReference video frame F_n-1The number of the characteristic points is marked as N_n-1. After the comparison and matching in step S608, in step S610, the number of matching feature points, such as the number M, is obtained.

In the process of comparing and matching the feature points in step S608, a fast nearest neighbor matching method may be used to determine the matching feature points in the two video frames. Specifically, assume that key video frame F is used_nFor reference, for key video frame F_nEach feature point Q extracted from_iThe following operations are performed:

obtaining a reference video frame F_n-1Each extracted feature point and the feature point Q_iThe euclidean distance between. Under the condition of finding two nearest neighbors (marked as a and b), acquiring a characteristic point Q_iEuclidean distance S from characteristic point a_aAnd characteristic point Q_iEuclidean distance S from characteristic point b_b. Then, compare S_aAnd T S_bAt S_a<T*S_bThen, a reference video frame F is determined_n-1The matched feature points exist in the image, and the image is the point a. Wherein, T can take a value of 0.5.

It should be noted that, the above is an example, and the reference video frame F may also be used_n-1For example, in key video frame F_nThe corresponding matching feature points are searched, and the searching manner may refer to the above process, which is not described herein again.

Further, the method can be used for preparing a novel materialSuppose that a key video frame F is acquired_nNumber of middle feature point sets N_nReference video frame F_n-1The number of the middle feature point sets is recorded as N_n-1And the number M of matching feature points, the ratios of the matching feature points in the feature points of the two frames, i.e., the first ratio and the second ratio, respectively, can be calculated based on the above data.

According to the embodiment provided by the application, the feature points in the key video frame and the feature points in the reference video frame are extracted through the Sift feature operator, so that the first proportion of the matched feature points in the key video frame and the second proportion of the matched feature points in the reference video frame are calculated based on the feature points, the scene video frame list is further finely divided based on the advanced features, and the accuracy of video segmentation processing is improved.

As an optional scheme, after sequentially acquiring feature similarity between each key video frame and a reference video frame located before the key video frame, the method further includes:

1) determining that the feature similarity reaches a merging condition under the condition that the cosine distance is greater than a second threshold;

2) determining that the feature similarity meets a merging condition under the condition that the cosine distance is smaller than or equal to a second threshold, the first ratio is larger than a third threshold and the second ratio is larger than a fourth threshold;

3) under the condition that the cosine distance is smaller than or equal to a second threshold value and the first ratio is smaller than or equal to a third threshold value, determining that the feature similarity does not reach a merging condition, and reserving a scene video frame sequence in a first scene where the key video frame is located;

4) and under the condition that the cosine distance is less than or equal to a second threshold value and the second ratio is less than or equal to a fourth threshold value, determining that the feature similarity does not reach a merging condition, and reserving the scene video frame sequence in the first scene where the key video frame is positioned.

The description is made with reference to the example shown in fig. 7: suppose key video frame F is acquired_nAnd reference video frame F_n-1Cosine distance d between_cos. Obtaining matched features after extracting the Sift feature operatorThe first ratio of the dots is p_nThe second ratio is p_n-1. The threshold value for comparison with the cosine distance is T_cosThe threshold for comparison with the first and second ratios is T_sift。

Specifically, when feature fusion is performed, in step S702, a key video frame F is acquired_nAnd reference video frame F_n-1Cosine distance d between_cosThen, step S704 is executed to determine d_cosWhether or not it is greater than threshold value T_cosIf d is_cos>T_cosDirectly convert the key video frame F_nThe scene and the reference video frame F_n-1The located scenes are merged, that is, in step S710-2, the video frame sequence corresponding to the scene where the key video frame is located is merged into the video frame sequence corresponding to the scene where the reference video frame is located. If d is_cosIs not greater than T_cosIf yes, the determination is continued, and step S706 is executed to determine the first proportion p of the matched feature points in the feature point set of the key video frame_nWhether or not it is greater than threshold value T_sift. If the first ratio p_nLess than or equal to threshold T_siftThen, in step S710-1, the respective scene segmentation results of the key video frame and the reference video frame are retained.

And at the first ratio p_nGreater than a threshold value T_siftIf so, step S708 is executed to determine a second proportion p of the matched feature points in the feature point set of the reference video frame_n-1Whether or not it is greater than threshold value T_sift. If the second ratio p_n-1Less than or equal to threshold T_siftThen, in step S710-1, the respective scene segmentation results of the key video frame and the reference video frame are retained. If the second ratio p_n-1Greater than a threshold value T_siftIn case of (3), in step S710-2, the video frame sequence corresponding to the scene where the key video frame is located is merged into the video frame sequence corresponding to the scene where the reference video frame is located.

The execution sequence of step S706 and step S708 may be exchanged, which is an example, and is not limited in this embodiment. That is, at the first ratio p_nAnd a second ratio p_n-1Are all greater than the threshold value T_siftIn case of (2), the key video frame F is then decoded_nThe scene and the reference video frame F_n-1The located scenes are merged, that is, in step S710-2, the video frame sequence corresponding to the scene where the key video frame is located is merged into the video frame sequence corresponding to the scene where the reference video frame is located. And if the ratio of any one time of the two times is smaller than the threshold value, keeping the scene division of the key video frame and the reference video frame in the first scene video frame list. The threshold value T is set to be equal to or greater than a predetermined value_cosCan be but is not limited to 0.8, the threshold value T_siftMay be, but is not limited to, 0.3. Here, this is an example, and this is not limited in this embodiment.

According to the embodiment provided by the application, the cosine distance calculated by using the feature vector between the key video frame and the reference video frame and the first proportion and the second proportion corresponding to the matching feature points obtained based on the feature points extracted by the Sift feature operator are utilized, and the high-level features are used for further updating and adjusting the first scene video frame list to obtain the second scene video frame list after fine segmentation, so that the accuracy of the video segmentation result is ensured.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiments of the present invention, there is also provided a video processing apparatus for implementing the above-described video processing method. As shown in fig. 8, the apparatus includes:

a first obtaining unit 802, configured to obtain a target video to be processed;

a first extraction unit 804, configured to perform feature extraction on each video frame in a target video in sequence to obtain an image feature set corresponding to each video frame, where the image feature set includes at least two image features of the video frame;

a dividing unit 806, configured to divide all video frames in a target video according to an image feature set to obtain a first scene video frame list, where scene video frame sequences respectively corresponding to multiple scenes included in the target video are recorded in the first scene video frame list, and a first video frame in each scene video frame sequence is a key video frame of a scene;

a second obtaining unit 808, configured to sequentially obtain feature similarities between each key video frame and a reference video frame located before the key video frame;

a merging and updating unit 810, configured to merge a scene video frame sequence in a first scene where the key video frame is located into a scene video frame sequence in a second scene where the reference video frame is located, so as to update the first scene video frame list into a second scene video frame list, when the feature similarity meets a merging condition;

and a division processing unit 812, configured to perform division processing on the target video according to the second scene video frame list.

For a specific embodiment, reference may be made to the above-mentioned embodiment of the video processing method, and details are not described herein again in this embodiment.

According to another aspect of the embodiment of the present invention, there is also provided an electronic device for implementing the video processing method, where the electronic device may be a terminal device or a server shown in fig. 1. The present embodiment takes the electronic device as a server as an example for explanation. As shown in fig. 9, the electronic device comprises a memory 902 and a processor 904, the memory 902 having stored therein a computer program, the processor 904 being arranged to perform the steps of any of the above-described method embodiments by means of the computer program.

Optionally, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of a computer network.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, acquiring a target video to be processed;

s2, sequentially extracting the features of each video frame in the target video to obtain an image feature set corresponding to each video frame, wherein the image feature set comprises at least two image features of the video frames;

s3, dividing all video frames in the target video according to the image feature set to obtain a first scene video frame list, wherein scene video frame sequences respectively corresponding to a plurality of scenes contained in the target video are recorded in the first scene video frame list, and the first video frame in each scene video frame sequence is a key video frame of the scene;

s4, sequentially acquiring the feature similarity between each key video frame and a reference video frame positioned before the key video frame;

s5, merging the scene video frame sequence in the first scene where the key video frame is located into the scene video frame sequence in the second scene where the reference video frame is located under the condition that the feature similarity reaches the merging condition, so as to update the first scene video frame list into a second scene video frame list;

s6, the target video is divided according to the second scene video frame list.

Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 9 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 9 is a diagram illustrating a structure of the electronic device. For example, the electronics may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

The memory 902 may be used to store software programs and modules, such as program instructions/modules corresponding to the video processing method and apparatus in the embodiments of the present invention, and the processor 904 executes various functional applications and data processing by running the software programs and modules stored in the memory 902, so as to implement the video processing method described above. The memory 902 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 902 may further include memory located remotely from the processor 904, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 902 may be specifically, but not limited to, used for storing the target video and intermediate information in the processing process, such as information like a scene video frame list. As an example, as shown in fig. 9, the memory 902 may include, but is not limited to, a first obtaining unit 802, a first extracting unit 804, a dividing unit 806, a second obtaining unit 808, a merging updating unit 810, and a dividing processing unit 812 in the video processing apparatus. In addition, the video processing apparatus may further include, but is not limited to, other module units in the video processing apparatus, which is not described in this example again.

Optionally, the transmitting device 906 is used for receiving or sending data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 906 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 906 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In addition, the electronic device further includes: and a connection bus 910 for connecting the module components in the electronic device.

In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting a plurality of nodes through a network communication. Nodes can form a Peer-To-Peer (P2P, Peer To Peer) network, and any type of computing device, such as a server, a terminal, and other electronic devices, can become a node in the blockchain system by joining the Peer-To-Peer network.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the video processing method. Wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring a target video to be processed;

s6, the target video is divided according to the second scene video frame list.

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A video processing method, comprising:

acquiring a target video to be processed;

sequentially extracting features of each video frame in the target video to obtain an image feature set corresponding to each video frame, wherein the image feature set comprises at least two image features of the video frames;

dividing all video frames in the target video according to the image feature set to obtain a first scene video frame list, wherein scene video frame sequences respectively corresponding to a plurality of scenes contained in the target video are recorded in the first scene video frame list, and a first video frame in each scene video frame sequence is a key video frame of the scene;

sequentially acquiring the feature similarity between each key video frame and a reference video frame positioned in front of the key video frame;

under the condition that the feature similarity reaches a merging condition, merging a scene video frame sequence in a first scene where the key video frame is located into a scene video frame sequence in a second scene where the reference video frame is located, so as to update the first scene video frame list into a second scene video frame list;

and carrying out segmentation processing on the target video according to the second scene video frame list.

2. The method of claim 1, wherein the sequentially performing feature extraction on the video frames in the target video to obtain an image feature set corresponding to each video frame comprises:

sequentially taking each video frame in the target video as a current video frame to execute the following feature extraction operations until all video frames in the target video are traversed:

mapping each pixel point in the current video frame into a target color coding space to obtain a parameter value of each image color component parameter of each pixel point in the target color coding space, wherein the target color coding space comprises at least two image color component parameters;

and determining the image feature set matched with the current video frame according to the parameter value of the image color component parameter of each pixel point.

3. The method of claim 2, wherein the determining the set of image features matching the current video frame according to the parameter values of the image color component parameters of the respective pixel points comprises:

and obtaining an average value of parameter values of ith image color component parameters of each pixel point to obtain ith image characteristics of the current video frame, wherein i is an integer which is greater than or equal to 1 and less than or equal to N, N is the number of the image color component parameters in the target color coding space, and N is a positive integer.

4. The method of claim 1, wherein the dividing all video frames in the target video according to the image feature set to obtain a first scene video frame list comprises:

acquiring the mean value of each image feature in the image feature set, and taking the mean value of the image features as a target image feature matched with the video frame;

sequentially comparing the target image characteristics corresponding to the two adjacent video frames to obtain a comparison result;

and dividing all video frames according to the comparison result to obtain the first scene video frame list.

5. The method of claim 4,

the sequentially comparing the target image characteristics corresponding to the two adjacent video frames to obtain a comparison result comprises: acquiring a characteristic difference value of a target image characteristic of a j +1 th video frame and a target image characteristic of a j video frame, wherein j is an integer which is greater than or equal to 1 and less than or equal to M-1, and M is the number of video frames in the target video; comparing the characteristic difference value with a first threshold value to obtain a comparison result;

the dividing all the video frames according to the comparison result to obtain the first scene video frame list comprises: under the condition that the comparison result indicates that the characteristic difference value is smaller than the first threshold value, determining that the j +1 th video frame and the j th video frame are the same scene, and adding the j +1 th video frame to a scene video frame sequence where the j th video frame is located; and under the condition that the comparison result indicates that the characteristic difference value is larger than or equal to the first threshold value, determining that the j +1 th video frame and the j th video frame are not in the same scene, and creating a new scene video frame sequence for the j +1 th video frame.

6. The method according to claim 1, wherein said sequentially obtaining feature similarities between each of said key video frames and a reference video frame preceding said key video frame comprises:

acquiring a key characteristic vector of the key video frame and a reference characteristic vector of the reference video frame;

obtaining a cosine distance between the key feature vector and the reference feature vector, wherein the feature similarity comprises the cosine distance;

acquiring matched feature points in the key video frame and the reference video frame;

and acquiring a first proportion of the matched feature points in the key video frame and a second proportion of the matched feature points in the reference video frame, wherein the feature similarity comprises the first proportion and the second proportion.

7. The method of claim 6, wherein the obtaining the key feature vector of the key video frame and the reference feature vector of the reference video frame comprises:

respectively preprocessing the key video frame and the reference video frame to obtain a candidate key video frame and a candidate reference video frame;

inputting the candidate key video frame into a lightweight convolutional neural network to obtain the key feature vector, and inputting the candidate reference video frame into the lightweight convolutional neural network to obtain the reference feature vector, wherein the lightweight convolutional neural network is a neural network for generating feature vectors of images, which is obtained after performing machine training by using a plurality of groups of sample image pairs and corresponding label information, each group of sample image pairs in the plurality of groups of sample image pairs comprises a first frame image in a first sample scene and a last frame image in a second sample scene in a sample video, the second sample scene is adjacent to the first sample scene and is positioned in front of the first sample scene, and the label information comprises a scene label of the first frame image and a scene label of the last frame image.

8. The method according to claim 7, further comprising, before said obtaining the target video to be processed:

acquiring a plurality of sample videos, and extracting the plurality of groups of image pairs in each sample video;

taking each group of image pairs as a current group of image pairs, and executing the following operations until reaching the convergence condition of the lightweight convolutional neural network:

inputting a first frame image in the first sample scene in the current group of image pairs into a first training convolutional neural network to obtain a first feature vector, and inputting a last frame image in the second sample scene in the current group of image pairs into a second training convolutional neural network to obtain a second feature vector, wherein a twin network structure is used in the training of the lightweight convolutional neural network, the twin network structure comprises the first training convolutional neural network and the second training convolutional neural network, and the first training convolutional neural network and the second training convolutional neural network share a training weight;

obtaining a cosine distance between the first feature vector and the second feature vector, and taking the cosine distance between the first feature vector and the second feature vector as a feature distance;

inputting the characteristic distance and the label information into a loss function to calculate to obtain a current loss value;

obtaining a loss value difference value of the current loss value and a last loss value of the current loss value;

and under the condition that the loss value difference value indicates that the twin network structure reaches the convergence condition, taking the first training convolutional neural network or the second training convolutional neural network which is currently trained as the lightweight convolutional neural network.

9. The method of claim 6, wherein the obtaining the matching feature points in the key video frame and the reference video frame comprises:

converting the key video frame into a key video frame gray image, and converting the reference video frame into a reference video frame gray image;

extracting a key feature point set from the key video frame gray image by adopting a scale-invariant feature change operator, and extracting a reference feature point set from the reference video frame gray image;

and comparing the key characteristic point set with the reference characteristic point set to obtain the matching characteristic points.

10. The method according to claim 6, further comprising, after said sequentially obtaining feature similarities between each of said key video frames and a reference video frame preceding said key video frame:

determining that the feature similarity reaches the merging condition when the cosine distance is greater than a second threshold;

determining that the feature similarity reaches the merging condition under the condition that the cosine distance is less than or equal to the second threshold, the first percentage is greater than a third threshold, and the second percentage is greater than a fourth threshold;

if the cosine distance is less than or equal to the second threshold and the first occupation ratio is less than or equal to the third threshold, determining that the feature similarity does not reach the merging condition, and reserving a scene video frame sequence in the first scene where the key video frame is located;

and under the condition that the cosine distance is less than or equal to the second threshold and the second proportion is less than or equal to the fourth threshold, determining that the feature similarity does not reach the merging condition, and reserving the scene video frame sequence in the first scene where the key video frame is positioned.

11. A video processing apparatus, comprising:

the first acquisition unit is used for acquiring a target video to be processed;

the first extraction unit is used for sequentially extracting features of each video frame in the target video to obtain an image feature set corresponding to each video frame, wherein the image feature set comprises at least two image features of the video frames;

the dividing unit is used for dividing all video frames in the target video according to the image feature set to obtain a first scene video frame list, wherein scene video frame sequences respectively corresponding to a plurality of scenes contained in the target video are recorded in the first scene video frame list, and a first video frame in each scene video frame sequence is a key video frame of the scene;

the second acquisition unit is used for sequentially acquiring the feature similarity between each key video frame and a reference video frame positioned in front of the key video frame;

a merging updating unit, configured to merge a scene video frame sequence in a first scene where the key video frame is located into a scene video frame sequence in a second scene where the reference video frame is located, so as to update the first scene video frame list into a second scene video frame list, when the feature similarity meets a merging condition;

and the segmentation processing unit is used for carrying out segmentation processing on the target video according to the second scene video frame list.

12. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 10.

13. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 10 by means of the computer program.