CN115604503A - Video generation method, electronic device, and storage medium - Google Patents

Video generation method, electronic device, and storage medium Download PDF

Info

Publication number
CN115604503A
CN115604503A CN202110719751.9A CN202110719751A CN115604503A CN 115604503 A CN115604503 A CN 115604503A CN 202110719751 A CN202110719751 A CN 202110719751A CN 115604503 A CN115604503 A CN 115604503A
Authority
CN
China
Prior art keywords
video
audio
time point
segment
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110719751.9A
Other languages
Chinese (zh)
Inventor
马天泽
马超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202110719751.9A priority Critical patent/CN115604503A/en
Publication of CN115604503A publication Critical patent/CN115604503A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/251Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4668Learning process for intelligent management, e.g. learning user preferences for recommending movies for recommending content, e.g. movies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method comprises the steps of dividing a video to be processed into a plurality of video segments, further calculating a recommendation index of each video segment, wherein the recommendation index comprises at least one of a first matching degree, an influence index, a second matching degree and the number of jumping actions, and then selecting a target video segment from the plurality of video segments according to the recommendation index, wherein the target video segment is a highlight segment corresponding to the video segment to be processed. The video generation method provided by the disclosure analyzes the video to be processed from multiple dimensions, so that wonderful segments can be accurately cut from the video to be processed. And through multi-dimensional analysis, the method can be suitable for different types of video clips, such as sports game videos, TV plays, movies and other video contents.

Description

Video generation method, electronic device, and storage medium
Technical Field
The present disclosure relates to the field of video technologies, and in particular, to a video generation method, an electronic device, and a storage medium.
Background
Short video refers to short-film video, and is generally video with the duration of less than 5 minutes spread on new internet media. With the rise of mobile internet, short videos are more and more favored by users. Therefore, it has become a research focus in the industry on how to convert long-length video content into short-length video content, such as movies and television shows.
In the related art, highlights in videos are generally located by motion recognition. For example, for a video of a sports game, highlights in the video of the sports game are located by identifying actions in the event of goals, shots, etc. However, motion detection is not applicable to other videos, such as movies, television shows, and the like. For example, by motion detection alone, it is impossible to ignore some segments based on highlight-to-white.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a video generation method, an electronic device, and a storage medium.
According to a first aspect of embodiments of the present disclosure, there is provided a video generation method, including:
segmenting a video to be processed to obtain a plurality of video segments;
calculating a recommendation index of each video segment based on attribute information of the video segment, wherein the attribute information comprises at least one of the following information:
a first degree of match between a picture classification of the video segment and a video classification of the video to be processed;
an influence index of a target object included in the video segment;
a second degree of match between the audio classification of the video segment and the video classification of the video to be processed;
when the user watches the video to be processed, moving the playing progress to the jumping action times of the time period of the video segment;
determining a target video segment from the plurality of video segments according to the recommendation index for each of the video segments.
In some embodiments, the segmenting the video to be processed to obtain a plurality of video segments includes:
extracting the audio in the video to be processed to obtain an audio file;
determining audio types to which audio frames in the audio file belong, wherein the audio types comprise human voice audio and non-human voice audio;
segmenting the audio file based on a boundary between an audio frame belonging to the human voice audio and an audio frame belonging to the non-human voice audio in the audio file to obtain audio segments comprising continuous human voices;
based on the audio segment, the video segment is obtained.
In some embodiments, the method further comprises:
determining a time point corresponding to a transition video frame in the video to be processed;
the deriving the video segment based on the audio segment comprises:
for each of the audio segments, performing the steps of:
based on the starting time point of the audio segment, searching a first target time point in the time points corresponding to the transition video frame, and determining the first target time point as a new starting time point of the audio segment, wherein the first target time point is a time point which is earlier than the starting time point in the time points corresponding to the transition video frame and is closest to the starting time point;
based on the ending time point of the audio segment, searching a second target time point in the time points corresponding to the transition video frame, and determining the second target time point as a new ending time point of the audio segment, wherein the second target time point is a time point which is later than the ending time point and closest to the ending time point in the time points corresponding to the transition video frame;
obtaining a new audio segment based on the new start time point and the new end time point;
based on the new audio segment, the video segment is obtained.
In some embodiments, said determining said second target point in time as a new ending point in time for the audio segment comprises:
calculating the time difference between the end time point corresponding to the audio segment and the second target time point;
determining the second target time point as a new ending time point of the audio segment under the condition that the time difference is greater than or equal to a preset time threshold;
and under the condition that the time difference is smaller than the preset time threshold, sequentially searching a new second target time point in the time point corresponding to the transition video frame by using the ending time point of each audio segment subsequent to the audio segment, and determining the new second target time point as the new ending time point of the audio segment under the condition that the time difference between the searched new second target time point and the ending time point corresponding to the audio segment for searching the new second target time point is larger than or equal to the preset time threshold.
In some embodiments, the first degree of matching is obtained by:
obtaining the video classification to which the video to be processed belongs;
taking the image frame of the video segment as the input of a first classification model to obtain the picture classification of the video segment;
and determining the first matching degree based on the video classification to which the video segment belongs and the video classification to which the video to be processed belongs.
In some embodiments, the impact index is obtained by:
performing image recognition on a video picture of the video segment, and determining a target object included in the video segment;
and calculating the influence index of the target object included in the video segment based on preset scores corresponding to different target objects.
In some embodiments, the second degree of match is obtained by:
acquiring a video classification to which the video to be processed belongs;
taking the background sound of the video segment as the input of a second classification model to obtain the audio classification of the background sound of the video segment;
and determining the second matching degree based on the audio classification to which the background sound of the video segment belongs and the video classification to which the video to be processed belongs.
In some embodiments, the number of jump actions is obtained by:
acquiring a play record of the video to be processed;
and determining the number of times that the user moves the playing progress to the time period of the video segment when watching the video to be processed based on the playing record, and taking the number of times as the number of jumping actions.
According to a second aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute instructions stored in the memory to implement the steps of the video generation method according to the first aspect of the present disclosure.
According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the video generation method provided by the first aspect of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the video to be processed is divided into a plurality of video segments, and the recommendation index of each video segment is calculated, so that the target video segment is selected from the plurality of video segments according to the recommendation index. The video to be processed can be analyzed from multiple dimensions, so that the highlight is accurately edited from the video to be processed. And through multi-dimensional analysis, the method can be suitable for different types of video clips, such as sports game videos, TV plays, movies and other video contents.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a flow diagram illustrating a video generation method in accordance with an exemplary embodiment;
FIG. 2 is a flow diagram illustrating segmentation of a video to be processed in accordance with an exemplary embodiment;
FIG. 3 is a schematic diagram illustrating the partitioning of a video to be processed in accordance with an exemplary embodiment;
FIG. 4 is a schematic flow diagram illustrating segmentation of a video to be processed based on audio segmentation in accordance with an exemplary embodiment;
FIG. 5 is a schematic diagram illustrating the principle of segmenting a video to be processed based on audio segmentation in accordance with an exemplary embodiment;
FIG. 6 is a schematic flow diagram illustrating the derivation of an audio segment in accordance with an exemplary embodiment;
FIG. 7 is a schematic diagram illustrating the structure of a deep learning model in accordance with an exemplary embodiment;
FIG. 8 is a block diagram of a video generation apparatus, according to an example embodiment;
fig. 9 is a block diagram illustrating an electronic device 800 in accordance with an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.
Fig. 1 is a flow chart illustrating a video generation method according to an exemplary embodiment. The method can be applied to an electronic device, wherein the electronic device can be a terminal, a server, or the like, as shown in fig. 1, and the video generation method can include the following steps.
In step 110, the video to be processed is segmented to obtain a plurality of video segments.
Here, the to-be-processed video refers to a video having a long time length that needs to be edited, such as a movie video, a television video, and the like. A plurality of video segments are obtained by segmenting a video to be processed. The video to be processed can be segmented based on a preset time interval, the video to be processed can be segmented at a transition video frame of the video to be processed by detecting a picture of the video to be processed, the audio of the video to be processed can be detected, and the video to be processed is segmented between boundaries of a human voice audio frame and a non-human voice audio frame of the video to be processed.
In some embodiments, the video to be processed may be pre-processed before being segmented. The preprocessing operation can include removing the head and the tail of the video to be processed to reduce the amount of calculation.
In step 120, a recommendation index of each of the video segments is calculated based on attribute information of the video segments, wherein the attribute information includes at least one of the following information:
a first degree of match between a picture classification of the video segment and a video classification of the video to be processed; an influence index of a target object included in the video segment; a second degree of match between the audio classification of the video segment and the video classification of the video to be processed; when the user watches the video to be processed, moving the playing progress to the jumping action times of the time period of the video segment;
here, the attribute information of the video segment may include one or more of a first matching degree, an influence index, a second matching degree, and a number of jump actions.
The first matching degree is the similarity between a picture classification determined according to a video picture of the video segment and a video classification of the video to be processed, the picture classification represents the type of the video segment, such as an action, a suspicion, a disaster, a war and the like, and the video classification of the video to be processed refers to the classification of the video to be processed, such as an action, a suspicion, a disaster, a war and the like. For example, a video picture of the video segment is analyzed, for example, elements such as hue, motion, speech, and the like of the video picture are analyzed to determine a picture classification of the video segment, and then the determined picture classification and the video classification of the video to be processed are subjected to similarity calculation to obtain a first matching degree.
The influence index of a target object included in a video segment refers to the influence indexes of all actors appearing in the video segment. The influence indexes of the target object can be determined according to the popularity of the actors, for actors with different popularity, the influence indexes correspond to different popularity, and the popularity is positively correlated with the influence indexes. For example, for an first-line actor, a second-line actor, the influence index of the first-line actor is greater than the influence index of the second-line actor. In the case that there are many known actors appearing in one video segment, the video segment can be used as a highlight of the video to be processed.
It should be understood that the target object of the present disclosure may be a human, an animated figure, an animal, etc. For different types of videos to be processed, the target objects included in the videos to be processed can be set according to the actual situation of the videos to be processed.
The second matching degree is the similarity between the audio classification determined according to the audio of the video segment and the video classification of the video to be processed, the audio classification represents the type of the video segment, such as action, suspicion, disaster, war and the like, and the video classification of the video to be processed refers to the classification of the video to be processed, such as action, suspicion, disaster, war and the like. For example, the audio of the video segment is analyzed, for example, the audio classification of the video segment is determined by analyzing elements such as background music, and then the similarity calculation is performed between the determined audio classification and the video classification of the video to be processed, so as to obtain the second matching degree.
The number of skip actions refers to the number of seek (skip) actions of moving the playing progress to the time period of the video segment when the user watches the video to be processed. Generally, when watching a video, a user selects to skip the segment for boring segments and watch the segment at a highlight time. Therefore, the jumping-action number can represent the wonderful degree of the video segment.
It should be noted that the recommendation index of the video segment may be calculated according to one or more of the attribute information of the video information to be processed. When the attribute information comprises one of a first matching degree, an influence index, a second matching degree and the number of times of the jump action, the one of the first matching degree, the influence index, the second matching degree and the number of times of the jump action is used as a recommendation index; when the attribute information includes a plurality of kinds of the first matching degree, the influence index, the second matching degree, and the number of jump actions, the attribute information may be calculated in a weighted manner. For example, when the attribute information includes the first degree of matching, the influence index, the second degree of matching, and the number of jump actions, the recommendation index may be calculated by the following calculation formula:
Score=αR video +βR audio +θR star +γR seek
wherein, score is a recommendation index, R video Is a first degree of matching, R audio As influence index, R seek Alpha, beta, theta and gamma are weighted values. Wherein the values of α, β, θ, γ can be set for different strategies of clipping highlights.
In step 130, a target video segment is determined from the plurality of video segments according to the recommendation index for each of the video segments.
Here, after determining the recommendation index to each video segment, the plurality of video segments may be sorted according to the recommendation index, and then the target video segment may be selected according to the sorting result. For example, the video segments at the preset sorting position can be used as the target video segments after being sorted from large to small according to the recommendation index. For example, the video segment at the top 10 bits of the sorting result is taken as the target video segment.
It is worth to be noted that the target video segment is a highlight video corresponding to the video to be processed, and the higher the recommendation index of the video segment is, the greater the highlight degree of the video segment is.
Therefore, in the embodiment of the disclosure, the video to be processed is divided into a plurality of video segments, and the recommendation index of each video segment is further calculated, so that the target video segment is selected from the plurality of video segments according to the recommendation index. The video to be processed can be analyzed from multiple dimensions, so that the highlight is accurately edited from the video to be processed. And through multi-dimensional analysis, the method can be suitable for different types of video clips, such as sports game videos, TV plays, movies and other video contents.
Fig. 2 is a flow diagram illustrating segmentation of a video to be processed according to an example embodiment. As shown in fig. 2, in some implementation embodiments, the step 110 of segmenting the video to be processed to obtain a plurality of video segments may include the following steps:
in step 111, the audio in the video to be processed is extracted to obtain an audio file.
Here, the audio file is an audio stream separated from the video to be processed, and the time axis of the audio file corresponds one-to-one to the time axis of the video to be processed.
It should be understood that the extraction of audio from the video to be processed may be performed by a specific algorithm or software, such as an audio converter, etc., and the process of obtaining the audio file will not be described in detail herein.
In step 112, an audio category to which an audio frame in the audio file belongs is determined, wherein the audio category includes human voice audio and non-human voice audio.
Here, an audio frame refers to a frame of audio obtained according to the encoding format of an audio file, and the length of the audio frame is different for different encoding formats. For example, for an AMR (Adaptive Multi-Rate) format, each 20ms Audio is an Audio frame, and for an mp3 (Moving Picture Experts Group Audio Layer III) format, the number of Audio frames is determined by a file size and a frame length, the length of each frame may not be fixed or may be fixed, which is determined by a code Rate, each Audio frame is divided into a frame header and a data entity, the frame header records information of the code Rate, the sampling Rate, the version, and the like of the mp3, and each frame is independent of each other. The audio types include human audio and non-human audio, wherein the human audio refers to voice signals sent by people in an audio file, such as a speech dialogue in a movie and television show, and the non-human audio refers to background noise in the audio file, such as background music in the movie and television show.
For each frame of audio frame in the audio file, it can be detected to determine the audio category to which each frame of audio frame belongs. The audio category of the audio frame may be determined by VAD (Voice Activity Detection) algorithm. The VAD algorithm is used to detect whether the audio frame belongs to human voice.
In step 113, the audio file is segmented based on a boundary between the audio frame belonging to the human voice audio and the audio frame belonging to the non-human voice audio in the audio file, so as to obtain an audio segment including continuous human voice.
Here, the audio file is divided into a plurality of segments based on a boundary between the human voice audio and the non-human voice audio, and then the segments including the continuous human voice are reserved as audio segments. The boundary line refers to a boundary line between an audio frame belonging to a human voice audio and an audio frame belonging to a non-human voice audio. For example, when the first to fifth frames of audio frames are audio frames belonging to a human voice audio and the sixth to seventh frames of audio frames are audio frames belonging to a non-human voice audio, the fifth and sixth frames of audio frames are segmented to obtain audio segments including the first to fifth frames of audio frames.
In step 114, the video segment is obtained based on the audio segment.
Here, the video to be processed may be segmented based on the critical time of the audio segmentation to obtain a plurality of video segments. The critical time of the audio segments refers to the starting time point and the ending time point of the audio segments, and since the audio segments are consistent with the time axis of the video to be processed, after the audio segments are obtained, the video to be processed can be segmented based on the critical time of each audio segment, so that the video segments corresponding to the audio segments are obtained.
The above embodiment will be described in detail with reference to fig. 3.
Fig. 3 is a schematic diagram illustrating a principle of segmenting a video to be processed according to an exemplary embodiment. As shown in fig. 3, audio is extracted from the video to be processed, and an audio file is obtained, where the lengths of time axes of the video to be processed and the audio file are the same, and the time points of the audio frame and the video frame are in one-to-one correspondence. The audio file is then segmented resulting in audio segments comprising consecutive human voices, as shown by the grey squares in fig. 3. Then, the video to be processed is segmented based on the critical time of the audio segment, and a corresponding video segment, such as a black square in fig. 3, is obtained. It should be appreciated that the final segmentation results in a video segment as indicated by the black squares in fig. 3.
FIG. 4 is a flowchart illustrating segmentation of a video to be processed based on audio segmentation, according to an exemplary embodiment. As shown in fig. 4, the video generating method may further include:
in step 210, a time point corresponding to a transition video frame in the video to be processed is determined.
Here, a transition video frame refers to a video frame that switches from one scene to another scene in the video to be processed. The histogram of the color model HSV space of the video frame image of the video to be processed can be obtained through a point-by-point pixel algorithm, and whether the video frame image belongs to a transition video frame or not is judged according to the histograms of two adjacent video frame images. For the same scene, the histogram of the two adjacent video frame images does not change greatly, and for different scenes, the histogram of the two adjacent video frame images changes greatly.
In step 114, obtaining the video segment based on the audio segment may include the following steps:
in step 211, based on the start time point of the audio segment, a first target time point is searched for in the time points corresponding to the transition video frame, and the first target time point is determined as a new start time point of the audio segment, where the first target time point is a time point that is earlier than the start time point and closest to the start time point in the time points corresponding to the transition video frame.
Here, the start time point of an audio segment refers to a time point corresponding to a first frame audio frame or a first frame video frame in a video segment corresponding to the audio segment.
In step 212, based on the ending time point of the audio segment, a second target time point is searched for from the time points corresponding to the transition video frames, and the second target time point is determined as a new ending time point of the audio segment, where the second target time point is a time point which is later than the ending time point and closest to the ending time point in the time points corresponding to the transition video frames.
Here, the ending time point of an audio segment refers to a time point corresponding to a last frame of an audio frame or a last frame of a video frame in a video segment corresponding to the audio segment.
In step 213, a new audio segment is derived based on the new start time point and the new end time point.
Here, the new audio segment is a segment in which the found first target time point, second target time point are a new start time point and a new end time point. For example, the start time and the end time of the original video segment are [ 11, 12 ].
In step 214, the video segment is derived based on the new audio segment.
Here, the video segment is obtained based on the new audio segment, and the video to be processed is also segmented based on the critical moment of the new video segment. The specific segmentation method has been described in detail in the above embodiments, and is not described herein again.
Fig. 5 is a schematic diagram illustrating a principle of segmenting a video to be processed based on audio segmentation according to an exemplary embodiment. As shown in fig. 5, the time points corresponding to the transition video frames identified in the video frames to be processed sequentially include a, B, C, D, and E, and then the time point a is found by searching for the first target time point in the time points corresponding to the transition video frames based on the start time point of the first audio segment 10 in fig. 5, and then the time point a is used as the new start time of the first audio segment 10. Then, a second target time point is searched for in the time point corresponding to the transition video frame based on the ending time point of the first audio segment 10, and a time point B is found, and then the time point B is used as a new ending time point of the first audio segment 10. For the second audio segment 20, its new start time point is time point C and its new end time point D. For the third audio segment 30, its new start time point is time point E, and its new end time point is the original end time point of the third audio segment 30. The resulting new audio segments are shown as grey squares in fig. 5, and the video to be processed is then segmented based on the new audio segments, resulting in video segments, which are shown as black squares in fig. 5.
It should be understood that, when searching for a time point, if a first target time point meeting the condition is not found, the original start time point of the audio segment is taken as the first target time point, and if a second target time point meeting the condition is not found, the original end time point of the audio segment is taken as the second target time point.
Therefore, by combining the VAD detection result and the detection result of the transition video frame to segment the video to be processed, the integrity of the shot can be ensured on the premise that each video segment can keep a wonderful segment. For example, for an audio segment, there is a speech silence segment between two adjacent audio segments. For movie and television drama videos, when a movie scenario changes, a silent speech segment appears between two scenarios, possibly due to a long conversation time interval between characters. Therefore, in order to ensure the shot integrity of the corresponding video segment of the audio segments obtained based on the VAD detection result, the start time point and the end time point of each audio segment are re-determined in combination with the detection result of the transition video frame, so that the shot integrity of the finally obtained video segment can be maintained.
In some realizable embodiments, the step 212 of determining the second target time point as a new ending time point of the audio segment comprises:
calculating the time difference between the end time point corresponding to the audio segment and the second target time point;
determining the second target time point as a new ending time point of the audio segment under the condition that the time difference is greater than or equal to a preset time threshold;
and under the condition that the time difference is smaller than the preset time threshold, sequentially searching a new second target time point in the time point corresponding to the transition video frame by using the ending time point of each audio segment subsequent to the audio segment, and determining the new second target time point as the new ending time point of the audio segment under the condition that the time difference between the searched new second target time point and the ending time point corresponding to the audio segment for searching the new second target time point is larger than or equal to the preset time threshold.
Here, when the first second target time point is found based on the end time point of the first audio segment, a time difference between the first second target time point and the end time point of the first audio segment is calculated. And when the time difference is greater than or equal to a preset time threshold, taking the first second target time point as a new ending time point of the first audio segment. And when the time difference is smaller than a preset time threshold, searching a new second target time point by using the end time point of the second audio segment, if the time difference between the new second target time point and the end time point of the second audio segment is larger than or equal to the preset time threshold, taking the new second target time point as the end time point of the first audio segment, and if the time difference is smaller than the preset time threshold, continuously searching a new second target time point by using the end time point of the third audio segment, and so on.
FIG. 6 is a schematic diagram illustrating a flow of obtaining audio segments according to an exemplary embodiment. As shown in fig. 6, the audio segments may be obtained by:
in step 310, the index i =0 of the audio segment is initialized, the index j =0 of the new audio segment result, and the total number of audio segments is N.
And step 320, judging whether i satisfies that i is less than N.
When i < N, step 330 is executed to find a first target time point among the time points of the transition video frames with the start time point of the ith audio segment, and take the first target time point as the start time point of the jth new audio segment.
And when the i is larger than or equal to N, executing the step 380 and outputting a result.
And step 340, searching a second target time point in the time point of the transition video frame by using the end time point of the ith audio segment, and calculating the time difference delta t between the end time point of the ith audio segment and the second target time point.
And step 350, judging whether the time difference delta T meets the condition that delta T is less than T. Wherein, T can be set according to practical situations, such as 5 seconds.
When Δ T < T, step 351 is performed, let i = i +1.
And step 352, judging whether i satisfies that i is less than N, and executing the step 340 when i is less than N.
When Δ T ≧ T, step 360 is executed, taking the second target time point as the end time point of the jth new audio segment.
Step 370, take the jth new audio segment as the result, let i = i +1, j = j +1, and return to execute step 320.
In some implementations, the first degree of matching is obtained by:
obtaining the video classification to which the video to be processed belongs;
taking the image frame of the video segment as the input of a first classification model to obtain the picture classification of the video segment;
and determining the first matching degree based on the picture classification to which the video segment belongs and the video classification to which the video to be processed belongs.
Here, the meaning of the video classification has been described in detail in the above embodiments, and is not described in detail here. In some embodiments, the video classification of the video to be processed may be obtained by obtaining the video classification tag of the video to be processed in the video website.
The first classification model may be a neural network model obtained by training with a training sample, where the training sample may include a video segment and a label for labeling a picture classification to which the video segment belongs. The first classification model may analyze image frames of the video segment, and may analyze color, texture, hue, and the like of the image frames to obtain a picture classification to which each image frame belongs, and then determine the picture classification of the video segment according to the picture classifications to which all the image frames belong. In some embodiments, each video segment may be classified based on a VideoTag model to determine the picture classification to which the video segment belongs. And then, calculating to obtain a first matching degree based on the similarity between the picture classification and the video classification of the video to be processed.
It should be understood that the greater the first degree of match, the more closely the video segment is illustrated to the video classification of the video to be processed, and the more likely the video segment is a highlight of the video to be processed.
In some implementations, the impact index is obtained by:
performing image recognition on a video picture of the video segment, and determining a target object included in the video segment;
and calculating the influence index of the target object included in the video segment based on preset scores corresponding to different target objects.
Here, the image recognition of the video frame of the video segment may be based on a trained neural network model to identify the video frame so as to determine the included target object in the video frame. Then, based on preset scores corresponding to different target objects, an influence index of the target objects included in the video segment is calculated. The influence indexes of the target object can be determined according to the popularity of actors, for actors with different popularity, different influence indexes are corresponding, and the popularity is positively correlated with the influence indexes. For example, for a first-line actor, a second-line actor, the first-line actor has a greater influence index than the second-line actor.
It should be understood that, when the influence index of the target object appearing in the video segment is larger, the probability that the video segment is a highlight corresponding to the video to be processed is higher.
In some implementations, the second degree of matching is obtained by:
obtaining the video classification to which the video to be processed belongs;
taking the background sound of the video segment as the input of a second classification model to obtain the audio classification to which the background sound of the video segment belongs;
and determining the second matching degree based on the audio classification to which the background sound of the video segment belongs and the video classification to which the video to be processed belongs.
Here, the meaning of the video classification has been described in detail in the above embodiments, and is not described again here. In some embodiments, the video classification of the video to be processed may be obtained by obtaining the video classification tag of the video to be processed in the video website.
The second classification model may be a neural network model obtained by training with a training sample, where the training sample may include a background sound and a label for labeling an audio classification to which the video segment belongs. The trained second classification model may classify background sounds of the video segments to determine audio classifications to which the video segments belong. And then calculating the similarity between the determined audio classification and the video classification of the video to be processed to obtain a second matching degree. It should be understood that the greater the second degree of match, the greater the likelihood that the video segment is a highlight corresponding to the video to be processed.
Wherein, the second classification model can be a deep learning model AudioClassifier. FIG. 7 is a block diagram illustrating a deep learning model according to an example embodiment. As shown in fig. 7, the background sound in the video segment is input to the AudioClassifier, and passes through the VGGish feature extraction Layer, the 128-dimensional full-link Layer (full _ Connect _ Layer 1), the ReLU Layer, the Dropout Layer, the 64-dimensional full-link Layer (full _ Connect _ Layer 1), the ReLU Layer, the Dropout Layer, the K-dimensional full-link Layer (full _ Connect _ Layer 1), and the Softmax classification Layer in sequence, so as to obtain the classification result of the background sound.
Wherein, the input background sound can pass through the VGGish feature extraction layer to obtain a 128-dimensional high-level feature vector. And then 4 layers of full connection layers are added to deepen the network so that the network can extract better audio feature representation, and the parameter K in the last layer of full connection layer is the number of audio categories to be classified actually. Finally, classification is realized through a Softmax layer, and a classification result of the background sound is obtained.
It should be understood that the deep learning model AudioClassifier is a machine learning model trained on audio samples collected based on K audio classes to be classified. In the training process, the loss function can adopt a cross entropy loss function, and the optimizer can adopt an Adam optimizer.
In some implementations, the number of jump actions is obtained by:
acquiring a play record of the video to be processed;
and determining the number of times that the user moves the playing progress to the time period of the video segment when watching the video to be processed based on the playing record, and taking the number of times as the number of jumping actions.
Here, the number of skip actions refers to the number of seek actions of moving the playing progress to a time period in which the video segment is located when the user watches the video to be processed. Generally, when watching a video, a user chooses to skip a segment of boredom and watch it at a highlight point in time. Thus, the number of jump actions can characterize the highlight of the video segment. Therefore, the jump action times are determined according to the playing records of the video to be processed by acquiring the playing records of the video to be processed.
The playing record of the video to be processed can be obtained on a video website, and the playing record comprises the playing times of the video to be processed and the jumping action times recorded in each playing process.
It should be understood that the greater the number of jump actions, the greater the likelihood of the video segment being a highlight corresponding to the video to be processed.
Fig. 8 is a block diagram illustrating a video generation apparatus according to an example embodiment. Referring to fig. 8, the apparatus includes a segmentation module 121, a score calculation module 122, and a determination module 123.
The segmentation module 121 is configured to segment a video to be processed to obtain a plurality of video segments;
the score calculating module 122 calculates a recommendation index for each of the video segments based on the attribute information of the video segment, wherein the attribute information includes at least one of the following information:
a first degree of match between a picture classification of the video segment and a video classification of the video to be processed;
an influence index of a target object included in the video segment;
a second degree of match between the audio classification of the video segment and the video classification of the video to be processed;
when the user watches the video to be processed, moving the playing progress to the jumping action times of the time period of the video segment;
the determining module 123 is configured to determine a target video segment from the plurality of video segments according to the recommendation index for each of the video segments.
With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.
The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the video generation method provided by the present disclosure.
Fig. 9 is a block diagram illustrating an electronic device 800 in accordance with an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 9, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the video generation method described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power components 806 provide power to the various components of the electronic device 800. Power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 800.
The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described video generation method.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the video generation method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the video generation method described above when executed by the programmable apparatus.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A video generation method, comprising:
segmenting a video to be processed to obtain a plurality of video segments;
calculating a recommendation index of each video segment based on attribute information of the video segment, wherein the attribute information comprises at least one of the following information:
a first degree of match between a picture classification of the video segment and a video classification of the video to be processed;
an influence index of a target object included in the video segment;
a second degree of match between the audio classification of the video segment and the video classification of the video to be processed;
when the user watches the video to be processed, moving the playing progress to the jumping action times of the time period of the video segment;
determining a target video segment from the plurality of video segments according to the recommendation index for each of the video segments.
2. The method of claim 1, wherein the segmenting the video to be processed into a plurality of video segments comprises:
extracting the audio in the video to be processed to obtain an audio file;
determining audio types to which audio frames in the audio file belong, wherein the audio types comprise human voice audio and non-human voice audio;
segmenting the audio file based on a boundary between an audio frame belonging to the human voice audio and an audio frame belonging to the non-human voice audio in the audio file to obtain audio segments comprising continuous human voices;
based on the audio segment, the video segment is obtained.
3. The video generation method of claim 2, wherein the method further comprises:
determining a time point corresponding to a transition video frame in the video to be processed;
the deriving the video segment based on the audio segment comprises:
for each of the audio segments, performing the steps of:
based on the starting time point of the audio segment, searching a first target time point in the time points corresponding to the transition video frame, and determining the first target time point as a new starting time point of the audio segment, wherein the first target time point is a time point which is earlier than the starting time point and closest to the starting time point in the time points corresponding to the transition video frame;
based on the end time point of the audio segment, searching a second target time point in the time points corresponding to the transition video frame, and determining the second target time point as a new end time point of the audio segment, wherein the second target time point is a time point which is later than the end time point and closest to the end time point in the time points corresponding to the transition video frame;
obtaining a new audio segment based on the new start time point and the new end time point;
based on the new audio segment, the video segment is obtained.
4. The video generation method according to claim 3, wherein said determining the second target time point as a new end time point of the audio segment comprises:
calculating the time difference between the end time point corresponding to the audio segment and the second target time point;
determining the second target time point as a new ending time point of the audio segment under the condition that the time difference is greater than or equal to a preset time threshold;
and when the time difference is smaller than the preset time threshold, sequentially searching a new second target time point in the time point corresponding to the transition video frame by using the ending time point of each audio segment subsequent to the audio segment, and determining the new second target time point as the new ending time point of the audio segment when the time difference between the searched new second target time point and the ending time point corresponding to the audio segment for searching the new second target time point is larger than or equal to the preset time threshold.
5. The video generation method of claim 1, wherein the first matching degree is obtained by:
obtaining the video classification to which the video to be processed belongs;
taking the image frame of the video segment as the input of a first classification model to obtain the picture classification of the video segment;
and determining the first matching degree based on the video classification to which the video segment belongs and the video classification to which the video to be processed belongs.
6. The video generation method of claim 1, wherein the influence index is obtained by:
performing image recognition on a video picture of the video segment, and determining a target object included in the video segment;
and calculating the influence index of the target object included in the video segment based on preset scores corresponding to different target objects.
7. The video generation method of claim 1, wherein the second matching degree is obtained by:
obtaining the video classification to which the video to be processed belongs;
taking the background sound of the video segment as the input of a second classification model to obtain the audio classification of the background sound of the video segment;
and determining the second matching degree based on the audio classification to which the background sound of the video segment belongs and the video classification to which the video to be processed belongs.
8. The video generation method according to claim 1, wherein the number of jump actions is obtained by:
acquiring a play record of the video to be processed;
and determining the number of times that the user moves the playing progress to the time period of the video segment when watching the video to be processed based on the playing record, and taking the number of times as the number of jumping actions.
9. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute instructions stored in the memory to implement the steps of the video generation method of any of claims 1 to 8.
10. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the video generation method of any of claims 1 to 8.
CN202110719751.9A 2021-06-28 2021-06-28 Video generation method, electronic device, and storage medium Pending CN115604503A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110719751.9A CN115604503A (en) 2021-06-28 2021-06-28 Video generation method, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110719751.9A CN115604503A (en) 2021-06-28 2021-06-28 Video generation method, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
CN115604503A true CN115604503A (en) 2023-01-13

Family

ID=84840612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110719751.9A Pending CN115604503A (en) 2021-06-28 2021-06-28 Video generation method, electronic device, and storage medium

Country Status (1)

Country Link
CN (1) CN115604503A (en)

Similar Documents

Publication Publication Date Title
CN109359636B (en) Video classification method, device and server
US10108709B1 (en) Systems and methods for queryable graph representations of videos
CN111258435B (en) Comment method and device for multimedia resources, electronic equipment and storage medium
KR102290419B1 (en) Method and Appratus For Creating Photo Story based on Visual Context Analysis of Digital Contents
CN111583907B (en) Information processing method, device and storage medium
US8750681B2 (en) Electronic apparatus, content recommendation method, and program therefor
US10410679B2 (en) Producing video bits for space time video summary
US8442389B2 (en) Electronic apparatus, reproduction control system, reproduction control method, and program therefor
US9646227B2 (en) Computerized machine learning of interesting video sections
CN110364146B (en) Speech recognition method, speech recognition device, speech recognition apparatus, and storage medium
CN110381366B (en) Automatic event reporting method, system, server and storage medium
CN111797820B (en) Video data processing method and device, electronic equipment and storage medium
US20240314372A1 (en) Machine learning based media content annotation
WO2023197979A1 (en) Data processing method and apparatus, and computer device and storage medium
CN103200463A (en) Method and device for generating video summary
Moreira et al. Multimodal data fusion for sensitive scene localization
CN111583919B (en) Information processing method, device and storage medium
CN110557659A (en) Video recommendation method and device, server and storage medium
CN110930984A (en) Voice processing method and device and electronic equipment
CN112150457A (en) Video detection method, device and computer readable storage medium
CN115359409A (en) Video splitting method and device, computer equipment and storage medium
CN111488813A (en) Video emotion marking method and device, electronic equipment and storage medium
CN116567351B (en) Video processing method, device, equipment and medium
CN112992148A (en) Method and device for recognizing voice in video
CN116261009B (en) Video detection method, device, equipment and medium for intelligently converting video audience

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination