CN113923504A - Video preview moving picture generation method and device - Google Patents

Video preview moving picture generation method and device Download PDF

Info

Publication number
CN113923504A
CN113923504A CN202111454801.1A CN202111454801A CN113923504A CN 113923504 A CN113923504 A CN 113923504A CN 202111454801 A CN202111454801 A CN 202111454801A CN 113923504 A CN113923504 A CN 113923504A
Authority
CN
China
Prior art keywords
video
determining
image
preview
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111454801.1A
Other languages
Chinese (zh)
Other versions
CN113923504B (en
Inventor
何永继
丁建栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Original Assignee
Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Damo Institute Hangzhou Technology Co Ltd filed Critical Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority to CN202111454801.1A priority Critical patent/CN113923504B/en
Publication of CN113923504A publication Critical patent/CN113923504A/en
Application granted granted Critical
Publication of CN113923504B publication Critical patent/CN113923504B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4312Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Television Signal Processing For Recording (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a video preview moving picture processing method and a video preview moving picture processing device, wherein the method comprises the following steps: dividing a video to be processed into a plurality of video segments; performing quality scoring on the video frame based on subtitle data, audio data and image data of the video frame; determining key frames of all video segments according to the quality scores of the video frames; and generating a preview moving picture of the video to be processed according to a plurality of key frames corresponding to a plurality of video segments. By adopting the processing mode, the key video frames are selected by taking the video segments as units, each video frame is comprehensively considered based on the multi-mode data of the video frames, and the generated preview dynamic image comprises the key frames of a plurality of video segments, so that the content of the preview dynamic image is more comprehensive, and the representativeness of the key frames in the preview dynamic image to the video is stronger.

Description

Video preview moving picture generation method and device
Technical Field
The application relates to the technical field of image processing, in particular to a method, a device and a system for generating a video preview moving picture and electronic equipment.
Background
The short video has the characteristics of long content time, strong stimulation points for the audience and the like. Enterprises put various video clips on the Internet in various forms, and the method is a marketing and popularization means for achieving a certain propaganda purpose. In a huge short video, how to attract the attention of a user and generate a click is important for the short video delivery and drainage. A commonly used short video delivery method is to extract a representative segment from the content of the short video, and use the representative segment as a cover-art image when the short video is delivered, so as to attract the attention of a user, thereby improving the exposure and click rate of the short video.
At present, a video abstract extraction technology is mainly adopted to generate a preview moving picture of a short video. One mode is a clustering mode, namely, firstly, video is segmented, then, the video is clustered by using a clustering algorithm, and representative frames are selected for each class to be spliced. The other mode is a key frame text summarization mode, the video is divided into a plurality of key frames, the key frames can be selected according to user preference, and subtitles of the key frames are extracted and integrated into a video summary. The other mode is a deep learning mode, namely, firstly, a moving target in a video is obtained, a target tracking technology is used for obtaining a target moving track, deep learning is used for judging the classification of the target, and then the moving target is synthesized into a video snapshot.
However, in the process of implementing the invention, the inventor finds that the existing scheme has at least the following problems: 1) the clustering mode cannot select key information in the video, only different types of picture frames in the video are reserved, and a video clip of a certain shot which is interested by a user in the video cannot be extracted; 2) the text abstract mode only generates a text abstract, and has limited attraction degree to the user; 3) the deep learning mode is suitable for searching the moving target in the monitoring field and is not suitable for short video summarization. In summary, the conventional technology has a problem that a short video preview image with high representativeness cannot be generated.
Disclosure of Invention
The application provides a video preview moving picture processing method, which aims to solve the problem that a video preview moving picture with higher representativeness cannot be generated in the prior art. The application further provides a video preview motion picture processing device and system and electronic equipment.
The application provides a video preview motion picture processing method, which comprises the following steps:
acquiring the video to be processed;
dividing the video to be processed into a plurality of video segments;
acquiring caption text, image data and audio data corresponding to the video segment;
determining the quality scores of video frames in the video segment corresponding to the dimensions of subtitles, images and audio according to the subtitle texts, the image data and the audio data;
determining a key frame of the video segment according to the quality score of the video frame;
and generating a preview moving picture of the video to be processed according to a plurality of key frames corresponding to the plurality of video segments.
Optionally, the dividing the video to be processed into a plurality of video segments includes:
and dividing the video to be processed into a plurality of shot fragments.
Optionally, the generating a preview motion picture of a video to be processed according to a plurality of key frames corresponding to the plurality of video segments includes:
forming a key frame segment of the shot segment according to the adjacent video frames of the key frame;
and according to the shot sequence, forming the key frame fragments into the preview motion picture.
Optionally, the determining a key frame of the video frame according to the quality score of the video frame includes:
determining the number of key frames of the shot according to the duration of the shot;
and selecting key frames from the shot fragments according to the quality scores and the number of the key frames.
Optionally, the generating a preview motion picture of a video to be processed according to a plurality of key frames corresponding to the plurality of video segments includes:
encoding the key frame in a different encoding format than the video to be processed;
and generating the preview motion picture according to the coded key frame.
Optionally, the method further includes:
and issuing the video to be processed and the preview moving picture.
Optionally, the determining, according to the subtitle text, the image data, and the audio data, a quality score of a video frame in the video segment corresponding to a subtitle dimension, an image dimension, and an audio dimension includes:
determining an image quality score, a video main body expression degree and an audio score of the video frame according to the subtitle text, the image data and the audio data corresponding to the video frame;
and determining the quality score according to the image quality score, the video subject expression degree and the audio score.
Optionally, the method further includes:
determining at least one video main body according to the subtitle text;
and determining the weight of the video main body according to the occurrence frequency of the video main body in the caption text.
Optionally, the determining the quality score according to the image quality score, the video subject expression level, and the audio score includes:
determining a plurality of sets of weight combinations of a first weight corresponding to the image quality score, a second weight corresponding to the video subject expressiveness, and a third weight corresponding to the audio score;
determining the mass fraction under each set of weight combination;
determining key frames under each group of weight combination according to the quality scores under each group of weight combination;
and generating a plurality of preview motion pictures of the video to be processed according to the key frames under the plurality of groups of weight combinations.
Optionally, the method further includes:
dividing the preview images into a plurality of classes through a clustering algorithm;
and selecting a target preview image from various images.
Optionally, the audio score is determined as follows:
determining an energy value of the audio data;
determining the audio score according to the energy value.
Optionally, the video subject expression level is determined by the following method:
acquiring the area ratio and/or the image position of a video main body in a video frame image;
acquiring the occurrence frequency of a video main body in the subtitle text;
and determining the expression degree according to the area ratio, the image position and/or the occurrence frequency.
Optionally, the image quality score is determined by the following method:
determining video main body weights corresponding to all color areas of the video frame image according to the image data;
and determining the image aesthetic degree as the image quality score according to at least the weight.
Optionally, the method further includes:
acquiring an audio clip corresponding to the video frames, wherein a plurality of video frames correspond to one audio clip;
the determining, according to the subtitle text, the image data, and the audio data, a quality score of a video frame in the video segment corresponding to a subtitle dimension, an image dimension, and an audio dimension includes:
and determining the quality score of the video frame according to the subtitle text, the image data and the audio clip corresponding to the video frame.
The present application also provides a video preview moving picture processing apparatus, including:
the video acquisition unit is used for acquiring the video to be processed;
the video dividing unit is used for dividing the video to be processed into a plurality of video segments;
the multi-mode data acquisition unit is used for acquiring subtitle texts, image data and audio data corresponding to the video segments;
the video frame scoring unit is used for determining the quality scores of the video frames in the video segments corresponding to the dimensions of the subtitles, the images and the audio according to the subtitle texts, the image data and the audio data;
a key frame determining unit, configured to determine a key frame of the video segment according to the quality score of the video frame;
and the dynamic image generating unit is used for generating a preview dynamic image of the video to be processed according to the plurality of key frames corresponding to the plurality of video segments.
The present application further provides an electronic device, comprising:
a processor and a memory; a memory for storing a program for implementing the above method, the device being powered on and the program for the method being run by the processor.
The present application also provides a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform the various methods described above.
The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.
Compared with the prior art, the method has the following advantages:
according to the video preview moving picture processing method provided by the embodiment of the application, a video to be processed is divided into a plurality of video segments; performing quality scoring on the video frame based on subtitle data, audio data and image data of the video frame; determining key frames of all video segments according to the quality scores of the video frames; and generating a preview moving picture of the video to be processed according to a plurality of key frames corresponding to a plurality of video segments. By adopting the processing mode, the key video frames are selected by taking the video segments as units, each video frame is comprehensively considered based on the multi-mode data of the video frames, and the generated preview dynamic image comprises the key frames of a plurality of video segments, so that the content of the preview dynamic image is more comprehensive, and the representativeness of the key frames in the preview dynamic image to the video is stronger.
Drawings
Fig. 1 is a schematic view of an application scenario of an embodiment of a video preview moving picture processing method provided by the present application;
FIG. 2 is a schematic flowchart of an embodiment of a video preview moving picture processing method provided by the present application;
fig. 3 is a schematic flowchart of an embodiment of a video preview moving picture processing method provided in the present application;
FIG. 4 is a key frame processing diagram of an embodiment of a video preview motion picture processing method provided by the present application;
fig. 5 is a schematic diagram of a short video release according to an embodiment of a video preview moving picture processing method provided by the present application;
fig. 6 is a schematic diagram of an embodiment of a video preview motion picture processing apparatus provided in the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The application provides a video preview motion picture generation method, a video preview motion picture generation device and video preview motion picture generation equipment. In order to facilitate intuitive understanding of the video preview motion picture processing method provided by the present application, an application scenario thereof is briefly described below, and then various schemes are described in detail one by one in each embodiment.
Please refer to fig. 1, which is a schematic view of an application scenario of an embodiment of a video preview motion picture processing method provided in the present application. In this embodiment, the video to be processed is a short video, and the main process of creating a link to be released by the short video is that an owner user of the short video prepares a video according to a video creative idea, releases the prepared video to a website, such as a live broadcast platform, an e-commerce platform, a social platform, a video platform, a short video platform, and the like, and can track a video release effect. A consumer user may view a short video through a user device. The consumer user usually finds the interested short videos for watching according to the preview images of the short videos displayed on the platform. According to the method provided by the application, representative segments in the short videos are automatically extracted according to the contents of the short videos and are used as cover page motion pictures when the short videos are released, so that the attention of a user is attracted, and the click rate is improved.
First embodiment
Please refer to fig. 2, which is a flowchart illustrating an embodiment of a video preview motion picture generating method according to the present application. In this embodiment, the method may include the steps of:
step S201: and acquiring a video to be processed.
The video to be processed includes, but is not limited to, short video, and may also be longer video. The short videos can be short videos with goods, short videos with advertisements, short videos for teaching, short videos with dramas, short videos with funs and the like.
The short video is taken as the short video with goods as an example, a lot of things can be sold through the short video with goods, the small quantity is diesel oil salt, and the big quantity is automobile insurance.
Taking the short video as an advertisement short video as an example, in order to better show the characteristics of the product, a merchant or a manufacturer may choose to use the short video for presentation, and although the high interactivity and the directness are not as high as those of live broadcasting, a part of consultation links are omitted compared with static pictures, which is beneficial to final ordering. Thus, purchasing the appropriate product promotion bits in combination with the short video helps to increase the arrival rate of the product and to facilitate the ultimate conversion.
In this embodiment, the owner user of the short video produces a video according to the video creative, and generates a preview moving picture for the short video through the method.
Step S203: and dividing the video to be processed into a plurality of video segments.
According to the method provided by the embodiment of the application, the video to be processed is divided into a plurality of video segments, and the video segments are taken as units to select representative frames for splicing, so that the previewing motion picture is obtained.
In one example, step S203 can be implemented as follows: and dividing the video to be processed into a plurality of shot fragments. From the content composition of the video, the video to be processed is composed of video clips of a plurality of scene shots, one shot is a continuous picture in the video, the video clip of one shot can comprise a plurality of video frames, the video frames of the same scene of the shot generally have little change and the sound effect is similar. By dividing the video into a plurality of shot segments, the generated preview motion picture comprises the representative frames of all the shots, so that the content of the preview motion picture is more comprehensive. In specific implementation, the shot segmentation algorithm can be used to segment the video to be processed into different shot segments.
In specific implementation, the video can be divided according to the time length, for example, each video segment is 30 seconds; the video may also be divided into a fixed number of video segments, such as into 10 video segments.
Step S205: and acquiring subtitle text, image data and audio data corresponding to the video segment.
Video is composed of still pictures, which are referred to as video frames. Each video frame has corresponding images, subtitles, and audio, multimodal data called video frames. According to the method provided by the embodiment of the application, each video frame is comprehensively considered based on multi-modal data of the video frames, and the key frame of each video segment is selected according to the quality of the video frame obtained by multi-dimensional consideration, so that the key frame can express key information of a short video more and the representativeness of the video segment is stronger.
The video to be processed is usually an audio and video file, comprises images and sounds, and can extract audio information in the video through an audio extractor. Since audio extraction belongs to the mature prior art, it is not described here.
The video frames are single frames and the audio is a segment, so it is common to make the audio and video frames temporally aligned. In one example, audio data corresponding to a video frame may be obtained as follows: and acquiring an audio clip corresponding to a video frame of the video to be processed, wherein a plurality of video frames correspond to one audio clip. Since each video frame has a separate score and the audio is segmented, the time of a video frame may be mapped to an audio segment where multiple video frames correspond to the audio segment.
In one example, the subtitle text may be obtained as follows: identifying a first caption text in a video frame, namely a caption displayed in a video frame picture; and taking the voice transcription text of the audio as the second caption text. In specific implementation, the characters recognition OCR technology can be used for extracting the subtitles in the video frame picture, and the speech transcription text of the video to be processed is extracted through the speech recognition ASR technology and is used as the subtitle text. Thus, more comprehensive subtitle information can be obtained.
Step S207: and determining the quality scores of the video frames in the video segment corresponding to the dimensions of the subtitles, the images and the audio according to the subtitle texts, the image data and the audio data.
The method provided by the embodiment of the application can comprehensively consider the influence of sound effect, image and caption in the video, and comprehensively score the video frame according to the influence, so as to obtain the quality score of the video frame. When the method is specifically implemented, the video frames can be scored according to the video frame images and the subtitle texts; scoring the video frames also based on the video frame images and audio data; the video frames may also be scored based on the video frame images, audio data, and subtitle text.
In one example, step S207 may include the following sub-steps:
step S2071: and determining the image quality score, the video main body expression degree and the audio score of the video frame according to the image data, the audio data and the subtitle text corresponding to the video frame.
In practical applications, short videos usually have a video main body, i.e. what the main focus of the video is extracted from, and especially with short videos is more obvious. For example, in a short video with a good, a good or a brand is identified by a caption.
The expression degree of the video subject comprises the expression degree of the video subject in the video frame image, and can be determined by adopting the following steps: 1) acquiring the area ratio and/or the image position of a video main body in a video frame image; 2) and determining the expression degree according to the area ratio and/or the image position.
1) The area ratio and/or the image position of the video main body in the video frame image are/is obtained.
When the expression degree of the video main body is determined, the area ratio of the main body in the video frame image and the position of the main body in the video frame image can be detected; wherein, the too large and too small area ratio (easy to be ignored) can affect the quality of the video frame image, and the position of the main body in the video frame image can also affect the expression degree of the main body in the video frame image, and the closer to the center, the better the expression, and vice versa. Furthermore, if the subject is at the edge of the video frame image or beyond the video frame image, the effect is worse.
2) And determining the expression degree according to the area ratio and/or the image position.
In practical application, the expression degree can be determined according to the area ratio, or the expression degree can be determined according to the image position, or the expression degree can be determined according to the area ratio and the image position together.
In particular, the degree of expression of a subject can be determined as follows: firstly, calculating the area ratio (denoted as si) of a main body region in a video frame image; then, the minimum distance between the subject edge and 4 image borders (four borders of the video frame image) is calculated, such as high: h1, h 2; width: w1, w2, calculating ∈ h = min (h1, h2)/max (h1, h2) and ∈ w = min (w1, w2)/max (w1, w2) as parameter values of the degree of expression of the video body. Where h1 may represent the distance between the upper vertex of the subject edge and the upper frame of the video frame image, h2 may represent the distance between the lower vertex of the subject edge and the lower frame of the video frame image, w1 may represent the distance between the left vertex of the subject edge and the left frame of the video frame image, and w2 may represent the distance between the right vertex of the subject edge and the right frame of the video frame image; ε h represents the longitudinal relative position of the subject in the video frame image, ε w represents the lateral relative position of the subject in the video frame image, and thus represents the degree to which the subject position is offset from the center point of the frame. Next, recalculate: and epsilon h epsilon w si is used as the value of the expression degree of the video main body, and the more the position of the visible main body deviates from the central point of the picture, the smaller the area of the main body is, the lower the expression degree of the video main body is.
In this embodiment, before determining the expression level of the video subject, the following steps may be further included: 1) determining a main body of a video to be processed; 2) a subject in a video frame image is identified.
1) The subject of the video to be processed is determined.
In specific implementation, the video subject can be designated by a user, and the subject can be automatically identified through an algorithm.
Subtitles are an important information source of short videos, and the subtitles usually include main body information, such as brand names, garment names and the like, which appear in subtitles of short videos with goods of certain brands of garments. According to the method provided by the embodiment of the application, the main body information of the short video is extracted from the subtitle text, different weights can be given to the video frames according to the relation between the main body and the video frame images, and the important video frames can be effectively extracted.
In specific implementation, the entity words of all subtitles of the video to be processed can be extracted, and the occurrence times of the entity words are counted; then, selecting entity words with higher times, such as entity words with the occurrence times ranked in the first three digits, and giving different weights θ i to the entity words, where the value of θ can be the occurrence times of the entity words in all subtitles/the total occurrence times of the entity words in all subtitles.
2) A subject in a video frame image is identified.
In specific implementation, the image of the subject can be preset, and the image of the subject can be identified from the video frame image through an image matching technology. In addition, an object detection model can be obtained from training data through learning of a machine learning algorithm, and the model can detect a main body image from the video frame image and judge the class of the main body, such as an automobile, a hair drier, clothes and the like.
For video frames, a video frame image quality score is a score for the video frame image that may focus on the aesthetics of the image. The beauty can be expressed in color contrast, richness and image quality.
In one example, the image quality score comprises an image quality score associated with the subject; the subject-related image quality score may be determined using the steps of: determining video main body weight corresponding to each color area of the video frame image; and determining the image aesthetic degree as an image quality score according to at least the weight.
In determining the aesthetic level, the following may be used: aiming at each color region, determining the aesthetic degree of the color region according to the video main weight corresponding to the color region, the area ratio of the color region and the color difference between the color region and the surrounding region; and the sum of the aesthetic degrees of the color regions is used as the image quality score of the video frame.
In this embodiment, first, the regions are divided according to the colors, and the area ratio (denoted as aj) of each color region and the color difference δ j between the color region and the adjacent region are calculated, for example, the color parameters in the HSV color space are hue (H), saturation (S), and lightness (V); then, detecting a main body area of a video frame, judging whether the main body area is the main body extracted by the subtitles, and giving different weight values theta i; and calculating the value of Σ { [1+ Σ (θ i | j ∈ i) ] × aj × δ j }, and using the value as an aesthetic measure value of the video frame. Wherein i represents the ith subject; (θ i | j ∈ i) indicates that if there is an ith principal in the jth color region, then θ i is taken, otherwise 0 is taken, and Σ (θ i | j ∈ i) indicates that the weights of the principals in the ith color region are added. As can be seen, for any color region, the larger the area is, the larger the color difference from the surrounding region is, and the larger the weight occupied by the subject corresponding to the color region is, the higher the degree of association between the color region and the subject is, the more the subject image is prominent with respect to the surrounding region, and the higher the score of the color region is. Therefore, if the sum of the scores of the color regions in one video frame image is higher, it indicates that the frame image has a higher association degree with the video subject and the subject image is more prominent under the background, and the frame image has a higher aesthetic degree.
In this embodiment, the effect of audio on video key information is considered. Generally, in short video, abrupt changes in audio represent important information. Whether it is a sudden increase or a sudden decrease, it represents a great difference from the previous paragraph. Some short videos have obvious fluctuation in audio frequency, especially climax or the parts specially emphasized by the creator are generally provided with special sound effects, so that the video frames corresponding to the special sound effects are important relative to other video frames. The method provided by the embodiment of the application can capture the wonderful part in the short video by utilizing the sudden change information of the audio.
In one example, the audio score may be determined using the following steps: determining an energy value of the audio data; from the energy values, an audio score is determined.
In specific implementation, the characteristic mutation point of the audio frequency can be detected, the audio frequency sequence is divided into different segments, energy values Ei of the different segments and an energy value E of the whole audio frequency are calculated, for example, the average amplitude of each audio frequency band in the ith audio frequency segment is taken as the energy value Ei of the ith audio frequency segment, the average energy value E of the whole audio frequency is calculated, and then | Ei-E |/E is calculated for the energy value Ei of each segment, and the larger the value is, the higher the possibility that the segment is the characteristic mutation point is. Multiple video frames corresponding to an audio segment may share the score for the audio segment.
Step S2073: and determining the quality score according to the image quality score, the video subject expression degree and the audio score.
And integrating the image quality score, the video main body expression degree and the audio score to determine the quality score of the video frame. In specific implementation, a weight value can be set for each dimension score, and the dimension scores are weighted and summed to obtain a quality score of the video frame. For example, the mass fraction may be calculated using the following formula:
φ=α*Σ{(θi|j∈i)*aj*δj} + β*εh*εw*si+γ*|Ei-E|/E。
wherein, alpha represents the weight of the image quality score, beta represents the weight of the expression degree of the video main body, gamma represents the weight of the audio, and the values of alpha, beta and gamma can be adjusted.
The method provided by this embodiment, through the above step S2071 and step S2073, makes the generated video summary more conform to the main body of the short video, the image quality is higher, and the abrupt change of the audio frequency also enables capturing of the wonderful part in the short video, so that the representativeness of the short video is stronger.
Step S209: and determining the key frame of the video segment according to the quality score of the video frame.
In the embodiment, a video to be processed is divided into a plurality of shot segments; accordingly, from the video frame scores, the key frames for each shot can be determined.
In one example, step S209 may include the following sub-steps: determining the number of key frames of the shot according to the duration of the shot; and selecting one or more key frames from the shot segment according to the video frame scores and the number of the key frames. Thus, for a longer shot, multiple key frames can be selected to more fully express the key content of the shot.
Step S211: and generating a preview moving picture of the video to be processed according to a plurality of key frames corresponding to a plurality of video segments.
As shown in fig. 4, in the method provided in the embodiment of the present application, the influence of sound effect, image and subtitle in the short video is comprehensively considered, and thus, the segments of the video are comprehensively scored, a plurality of key frames can be selected from the video to be processed according to the video frame scores, and the key frames form the preview moving picture.
In one example, step S211 may include the following sub-steps: forming a key frame segment of the shot segment according to adjacent video frames of the key frame, and forming the key frame segment if each 5 frames before and after the key frame is selected; and according to the shot sequence, combining the key frame fragments into a preview motion picture. By adopting the processing mode, the pictures among all the lenses are smoother, and the visual effect of previewing the moving picture can be improved.
In one example, step S2073 may be implemented as follows: determining a plurality of sets of weight combinations of a first weight corresponding to the image quality score, a second weight corresponding to the video subject expressiveness, and a third weight corresponding to the audio score; and determining the mass fraction under each group of weight combination. Therefore, the key frame under each group of weight combination can be determined according to the quality score under each group of weight combination; and generating a plurality of preview motion pictures of the video to be processed according to the key frames under the plurality of groups of weight combinations. In this way, a plurality of preview images can be generated, and the preview image with better effect can be selected from the preview images.
In one example, a plurality of preview motion pictures are divided into a plurality of classes through a clustering algorithm; and selecting a target preview image from various images.
In the specific implementation, a plurality of preview images can be generated according to different trends (image beauty, body expression, audio trend). When a plurality of preview moving pictures are generated according to different trends, the trends can be adjusted by taking n sets of values for (α, β, γ), such as coefficients of (1, 1, 1), (1, 2, 4), (4, 2, 1) and the like expressing different trends. Fig. 3 shows the processing of each set of values, which may include the following steps: 1) the video to be processed is divided into different segments, each segment belongs to the same scene shot, and the initial and final shots can be usually removed; 2) scoring the video frames of each shot, namely calculating the key frame score of each shot according to the values of alpha, beta and gamma, and selecting the top 5 or the top three key frames according to the shot duration; 3) selecting partial frames (about 1s in duration) before and after the selected key frame to form a key frame segment; 4) and splicing the key frame segments to form a preview motion picture. Therefore, a plurality of preview motion pictures can be obtained, Kmeans clustering (clustering m groups) can be carried out according to the key frame information, the motion pictures close to the class center are reserved, and m motion pictures can be obtained and displayed for a user.
In specific implementation, a plurality of preview motion pictures can be displayed for a video owner, and a target preview motion picture is selected manually to be published.
In one example, step S211 can be implemented as follows: and coding a plurality of key frames in a coding format different from the video to be processed to form the preview motion picture. For example, the encoding format of the video to be processed is MPEG, a plurality of key frames are synthesized into a GIF moving picture, and the encoding format of the preview moving picture is GIF. Therefore, the file size of the preview moving picture can be effectively reduced, and storage resources and network resources are saved.
In one example, the method may further comprise the steps of: and issuing a video to be processed and previewing a motion picture. As shown in fig. 5, after the preview moving picture is generated in the above manner, the video to be processed and the corresponding preview moving picture can be launched to a specified website, such as a live broadcast platform, an e-commerce platform, a social platform, a video platform, a short video platform, and the like, and then the video launching effect can be tracked.
In specific implementation, a plurality of issued target preview motion pictures can be tracked, and the weight of each dimension score can be adjusted according to the click rate of each preview motion picture.
As can be seen from the foregoing embodiments, the video preview moving picture processing method provided in the embodiments of the present application divides a video to be processed into a plurality of video segments; performing quality scoring on the video frame based on subtitle data, audio data and image data of the video frame; determining key frames of all video segments according to the quality scores of the video frames; and generating a preview moving picture of the video to be processed according to a plurality of key frames corresponding to a plurality of video segments. By adopting the processing mode, the key video frames are selected by taking the video segments as units, each video frame is comprehensively considered based on the multi-mode data of the video frames, and the generated preview dynamic image comprises the key frames of a plurality of video segments, so that the content of the preview dynamic image is more comprehensive, and the representativeness of the key frames in the preview dynamic image to the video is stronger.
Second embodiment
In the foregoing embodiment, a video preview moving picture processing method is provided, and correspondingly, the present application further provides a video preview moving picture processing apparatus. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points. The device embodiments described below are merely illustrative.
The present application further provides a video preview moving picture processing apparatus, which may be a virtual apparatus implemented by computer software to implement the above method. The components of the device are functional modules established for implementing the steps of the method, and the device is a functional module framework for implementing the solution. The device can be deployed at a user terminal, a server side or a cloud side, or part of the functional modules are deployed at the user terminal and part of the functional modules are deployed at the server side or the cloud side. The user terminal includes but is not limited to: personal computers, smart phones, tablet computers, and the like.
The device may comprise the following functional modules:
the video acquisition unit is used for acquiring a video to be processed; the video dividing unit is used for dividing the video to be processed into a plurality of video segments; the multi-mode data acquisition unit is used for acquiring subtitle texts, image data and audio data corresponding to the video segments; the video frame scoring unit is used for determining the quality scores of the video frames in the video segments corresponding to the dimensions of the subtitles, the images and the audio according to the subtitle texts, the image data and the audio data; the key frame determining unit is used for determining the key frame of the video segment according to the quality score of the video frame; and the dynamic image generating unit is used for generating a preview dynamic image of the video to be processed according to a plurality of key frames corresponding to a plurality of video segments.
In one example, the apparatus is deployed in a cloud, a user may specify a video to be processed through a terminal device of the user, the video may be stored in a local end, a server end, or the cloud, and a preview motion picture generation request for the video is sent to the cloud. After receiving the request, the cloud generates a preview moving picture of the video through the functional module, and the preview moving picture can be returned to the user terminal, and the user terminal displays the preview moving picture for the user to watch. The user can decide whether to distribute the preview movie to the website based on the evaluation of the movie effect. For example, the user may publish a video to a certain video platform in advance, and then may specify a URL address of the video, and the cloud may read the video from the video platform according to the URL address, generate a preview moving picture for the video, and return the preview moving picture to the user terminal.
In another example, the user terminal has abundant computing resources and storage resources, the apparatus is deployed at the user terminal, the user may send an instruction to the apparatus to generate a preview moving picture for a specified video to be processed, the preview moving picture of the video is generated through the functional module, and the preview moving picture may be displayed at the user terminal for the user to watch. The user can decide whether to simultaneously distribute the video and the preview movie to a certain website according to the evaluation of the movie effect.
Optionally, the video dividing unit is specifically configured to divide the video to be processed into a plurality of shot segments.
Optionally, the motion picture generating unit includes: a key frame segment obtaining subunit, configured to form a key frame segment of the shot segment according to adjacent video frames of the key frame; and the key frame segment splicing subunit is used for forming the preview moving picture by the key frame segments according to the shot sequence.
Optionally, the key frame determining unit includes: a key frame number determining subunit, configured to determine, according to the duration of the shot, the number of key frames of the shot; and the key frame selecting subunit is used for selecting key frames from the shot fragments according to the quality scores and the number of the key frames.
Optionally, the motion picture generating unit is specifically configured to encode the plurality of key frames in a different encoding format from the video to be processed, so as to form a preview motion picture.
Optionally, the video preview moving picture processing apparatus further includes: and the issuing unit is used for issuing the video to be processed and the preview moving picture.
In this embodiment, the publishing unit is deployed at the user terminal, and the user sends the video and the corresponding preview moving picture to the website server through the publishing unit, and publishes the video and the corresponding preview moving picture to a website, such as a live broadcast platform, an e-commerce platform, a social platform, a video platform, a short video platform, and the like. In specific implementation, the video can be issued through the issuing unit first, and the preview moving picture is issued through the issuing unit after the preview moving picture is generated. Optionally, the video frame scoring unit includes: the multi-dimensional scoring subunit is used for determining an image quality score, a video main body expression degree and an audio score of the video frame according to the subtitle text, the image data and the audio data corresponding to the video frame; and the comprehensive scoring subunit is used for determining the quality score according to the image quality score, the video main body expression degree and the audio score.
Optionally, the comprehensive scoring subunit includes: determining a plurality of sets of weight combinations of a first weight corresponding to the image quality score, a second weight corresponding to the video subject expressiveness, and a third weight corresponding to the audio score; determining the mass fraction under each set of weight combination; determining key frames under each group of weight combination according to the quality scores under each group of weight combination; and generating a plurality of preview motion pictures of the video to be processed according to the key frames under the plurality of groups of weight combinations.
Optionally, the video preview moving picture processing apparatus includes: the preview moving picture classifying unit is used for classifying the preview moving pictures into a plurality of classes through a clustering algorithm; and the preview moving picture selecting unit is used for selecting the target preview moving picture from various moving pictures.
In one example, a plurality of preview images of the video are generated through the cloud, the preview images are displayed through the user terminal, and the user selects the preview image which is interested in the user terminal and publishes the preview image to the website.
Optionally, the audio score is determined as follows: determining an energy value of the audio data; determining the audio score according to the energy value.
Optionally, the video subject expression level is determined as follows: acquiring the area ratio and/or the image position of a video main body in a video frame image; acquiring the occurrence frequency of a video main body in a subtitle text; and determining the expression degree according to the area ratio, the image position and/or the occurrence frequency.
Optionally, the image quality score is determined as follows: determining video main body weights corresponding to all color areas of the video frame image according to the image data; and determining the image aesthetic degree as the image quality score at least according to the weight.
Optionally, the video preview moving picture processing apparatus further includes: the audio corresponding unit is used for acquiring an audio segment corresponding to the video frames, and the plurality of video frames correspond to one audio segment;
the video frame scoring unit is specifically configured to determine the quality score of the video frame according to the subtitle text, the image data, and the audio segment corresponding to the video frame.
Third embodiment
In the foregoing embodiment, a video preview moving picture processing method is provided, and correspondingly, the present application further provides an electronic device. The apparatus corresponds to an embodiment of the method described above. Since the apparatus embodiment is basically similar to the method embodiment, the description is simple, and the relevant points can be referred to the partial description of the method embodiment. The device examples described below are merely illustrative.
The present application additionally provides an electronic device comprising: a processor; and a memory for storing a program for implementing the video preview moving picture processing method according to any one of the above, the terminal being powered on and the processor running the program of the method.
Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (14)

1. A video preview moving picture processing method, comprising:
acquiring a video to be processed;
dividing the video to be processed into a plurality of video segments;
acquiring caption text, image data and audio data corresponding to the video segment;
determining the quality scores of video frames in the video segment corresponding to the dimensions of subtitles, images and audio according to the subtitle texts, the image data and the audio data;
determining a key frame of the video segment according to the quality score of the video frame;
and generating a preview moving picture of the video to be processed according to a plurality of key frames corresponding to the plurality of video segments.
2. The method of claim 1,
the dividing the video to be processed into a plurality of video segments comprises:
and dividing the video to be processed into a plurality of shot fragments.
3. The method of claim 2,
the generating a preview motion picture of the video to be processed according to the plurality of key frames corresponding to the plurality of video segments comprises:
forming a key frame segment of the shot segment according to the adjacent video frames of the key frame;
and according to the shot sequence, forming the key frame fragments into the preview motion picture.
4. The method of claim 2,
determining a key frame of a video frame according to the quality score of the video frame comprises:
determining the number of key frames of the shot according to the duration of the shot;
and selecting key frames from the shot fragments according to the quality scores and the number of the key frames.
5. The method according to claim 1, wherein the generating a preview motion picture of the video to be processed according to the plurality of key frames corresponding to the plurality of video segments comprises:
and coding the plurality of key frames in a coding format different from the video to be processed to form the preview motion picture.
6. The method of claim 1, further comprising:
and issuing the video to be processed and the preview moving picture.
7. The method of claim 1,
the determining, according to the subtitle text, the image data, and the audio data, a quality score of a video frame in the video segment corresponding to a subtitle dimension, an image dimension, and an audio dimension includes:
determining an image quality score, a video main body expression degree and an audio score of the video frame according to the subtitle text, the image data and the audio data corresponding to the video frame;
and determining the quality score according to the image quality score, the video subject expression degree and the audio score.
8. The method of claim 7, further comprising:
determining at least one video main body according to the subtitle text;
and determining the weight of the video main body according to the occurrence frequency of the video main body in the caption text.
9. The method of claim 7,
determining the quality score according to the image quality score, the video subject expression level and the audio score comprises:
determining a plurality of sets of weight combinations of a first weight corresponding to the image quality score, a second weight corresponding to the video subject expressiveness, and a third weight corresponding to the audio score;
determining the mass fraction under each set of weight combination;
determining key frames under each group of weight combination according to the quality scores under each group of weight combination;
and generating a plurality of preview motion pictures of the video to be processed according to the key frames under the plurality of groups of weight combinations.
10. The method of claim 9, further comprising:
dividing the preview images into a plurality of classes through a clustering algorithm;
and selecting a target preview image from various images.
11. The method of claim 7,
the audio score is determined by the following method:
determining an energy value of the audio data;
determining the audio score according to the energy value.
12. The method of claim 7,
the video main body expression degree is determined by adopting the following method:
acquiring the area ratio and/or the image position of a video main body in a video frame image;
and determining the expression degree according to the area ratio and/or the image position.
13. The method of claim 7,
the image quality score is determined in the following way:
determining video main body weights corresponding to all color areas of the video frame image according to the image data;
and determining the image aesthetic degree as the image quality score according to at least the weight.
14. A video preview moving picture processing apparatus, comprising:
the video acquisition unit is used for acquiring a video to be processed;
the video dividing unit is used for dividing the video to be processed into a plurality of video segments;
the multi-mode data acquisition unit is used for acquiring subtitle texts, image data and audio data corresponding to the video segments;
the video frame scoring unit is used for determining the quality scores of the video frames in the video segments corresponding to the dimensions of the subtitles, the images and the audio according to the subtitle texts, the image data and the audio data;
a key frame determining unit, configured to determine a key frame of the video segment according to the quality score of the video frame;
and the dynamic image generating unit is used for generating a preview dynamic image of the video to be processed according to the plurality of key frames corresponding to the plurality of video segments.
CN202111454801.1A 2021-12-02 2021-12-02 Video preview moving picture generation method and device Active CN113923504B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111454801.1A CN113923504B (en) 2021-12-02 2021-12-02 Video preview moving picture generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111454801.1A CN113923504B (en) 2021-12-02 2021-12-02 Video preview moving picture generation method and device

Publications (2)

Publication Number Publication Date
CN113923504A true CN113923504A (en) 2022-01-11
CN113923504B CN113923504B (en) 2022-03-08

Family

ID=79248535

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111454801.1A Active CN113923504B (en) 2021-12-02 2021-12-02 Video preview moving picture generation method and device

Country Status (1)

Country Link
CN (1) CN113923504B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116955873A (en) * 2023-09-18 2023-10-27 北京佰信蓝图科技股份有限公司 Method for rapidly displaying massive dynamic planar vector data on browser
CN117097954A (en) * 2023-09-13 2023-11-21 北京饼干科技有限公司 Video processing method, device, medium and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070147504A1 (en) * 2005-12-23 2007-06-28 Qualcomm Incorporated Selecting key frames from video frames
CN109587581A (en) * 2017-09-29 2019-04-05 阿里巴巴集团控股有限公司 Video breviary generation method and video breviary generating means
CN112418012A (en) * 2020-11-09 2021-02-26 武汉大学 Video abstract generation method based on space-time attention model
CN112653918A (en) * 2020-12-15 2021-04-13 咪咕文化科技有限公司 Preview video generation method and device, electronic equipment and storage medium
CN113626641A (en) * 2021-08-11 2021-11-09 南开大学 Method for generating video abstract based on multi-mode data and aesthetic principle through neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070147504A1 (en) * 2005-12-23 2007-06-28 Qualcomm Incorporated Selecting key frames from video frames
CN109587581A (en) * 2017-09-29 2019-04-05 阿里巴巴集团控股有限公司 Video breviary generation method and video breviary generating means
CN112418012A (en) * 2020-11-09 2021-02-26 武汉大学 Video abstract generation method based on space-time attention model
CN112653918A (en) * 2020-12-15 2021-04-13 咪咕文化科技有限公司 Preview video generation method and device, electronic equipment and storage medium
CN113626641A (en) * 2021-08-11 2021-11-09 南开大学 Method for generating video abstract based on multi-mode data and aesthetic principle through neural network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117097954A (en) * 2023-09-13 2023-11-21 北京饼干科技有限公司 Video processing method, device, medium and equipment
CN116955873A (en) * 2023-09-18 2023-10-27 北京佰信蓝图科技股份有限公司 Method for rapidly displaying massive dynamic planar vector data on browser
CN116955873B (en) * 2023-09-18 2023-12-08 北京佰信蓝图科技股份有限公司 Method for rapidly displaying massive dynamic planar vector data on browser

Also Published As

Publication number Publication date
CN113923504B (en) 2022-03-08

Similar Documents

Publication Publication Date Title
CN108307229B (en) Video and audio data processing method and device
US11057457B2 (en) Television key phrase detection
US20170311014A1 (en) Social Networking System Targeted Message Synchronization
US9554184B2 (en) Method and apparatus for increasing user engagement with video advertisements and content by summarization
US8804999B2 (en) Video recommendation system and method thereof
US8126763B2 (en) Automatic generation of trailers containing product placements
CN113923504B (en) Video preview moving picture generation method and device
US20140337126A1 (en) Timed comments for media
WO2016028813A1 (en) Dynamically targeted ad augmentation in video
US10721519B2 (en) Automatic generation of network pages from extracted media content
CN111836118B (en) Video processing method, device, server and storage medium
WO2023045635A1 (en) Multimedia file subtitle processing method and apparatus, electronic device, computer-readable storage medium, and computer program product
CN105657514A (en) Method and apparatus for playing video key information on mobile device browser
KR102560609B1 (en) Video generation method and server performing thereof
US11741996B1 (en) Method and system for generating synthetic video advertisements
Hong et al. Movie2comics: a feast of multimedia artwork
CN110418148B (en) Video generation method, video generation device and readable storage medium
US11657850B2 (en) Virtual product placement
CN114845149A (en) Editing method of video clip, video recommendation method, device, equipment and medium
KR102560610B1 (en) Reference video data recommend method for video creation and apparatus performing thereof
CN116389849A (en) Video generation method, device, equipment and storage medium
CN112561549A (en) Advertisement generation method, advertisement delivery method, advertisement generation device and advertisement delivery device
WO2024104286A1 (en) Video processing method and apparatus, electronic device, and storage medium
KR20240059602A (en) Video recommendation method and apparatus performing thereof
CN118014659A (en) Electronic propaganda product generation and playing method, system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240201

Address after: Room 553, 5th Floor, Building 3, No. 969 Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province, 311121

Patentee after: Hangzhou Alibaba Cloud Feitian Information Technology Co.,Ltd.

Country or region after: China

Address before: 311121 Room 516, floor 5, building 3, No. 969, Wenyi West Road, Wuchang Street, Yuhang District, Hangzhou City, Zhejiang Province

Patentee before: Alibaba Dharma Institute (Hangzhou) Technology Co.,Ltd.

Country or region before: China