CN116980645A - Method, device, computer equipment and storage medium for generating abstract video clips - Google Patents

Method, device, computer equipment and storage medium for generating abstract video clips Download PDF

Info

Publication number
CN116980645A
CN116980645A CN202310077114.5A CN202310077114A CN116980645A CN 116980645 A CN116980645 A CN 116980645A CN 202310077114 A CN202310077114 A CN 202310077114A CN 116980645 A CN116980645 A CN 116980645A
Authority
CN
China
Prior art keywords
video
segment
target
played
segments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310077114.5A
Other languages
Chinese (zh)
Inventor
陈小帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310077114.5A priority Critical patent/CN116980645A/en
Publication of CN116980645A publication Critical patent/CN116980645A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23424Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for inserting or substituting an advertisement
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data
    • H04N21/25891Management of end-user data being end-user preferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Graphics (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The application relates to a method, a device, a computer device, a storage medium and a computer program product for generating abstract video clips. The method comprises the following steps: when a target object video to be played exists, acquiring a plurality of video clips aiming at the video to be played; according to the plurality of video clips, determining a video clip set matched with the selection bias information of the target object, and synthesizing the video clips based on the video clip set to obtain at least two synthesized video clips; based on the selection deflection information and the video to be played, respectively carrying out playing completion degree evaluation on at least two synthesized video clips to obtain a corresponding evaluation result of each synthesized video clip; and determining the abstract video segments of the target object aiming at the video to be played from the at least two synthesized video segments based on the respective evaluation results of the at least two synthesized video segments. The method can improve the utilization rate of video resources.

Description

Method, device, computer equipment and storage medium for generating abstract video clips
Technical Field
The present application relates to the field of computer technology, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for generating a summarized video clip.
Background
With the development of computer technology, video applications have emerged, which are mainly used for playing video data, such as tv dramas, movies, etc.
Currently, when playing video data in a video application, for video data such as a television play, a movie, etc., a beginning of a movie is usually played when the video data starts to be played, and a tail of the movie is played when the video data ends to be played.
However, in the current video data playing, the head and tail of the video data are generally prefabricated based on the content of the video data, so that many objects watching the video data can directly skip the head and tail, and the problem of low utilization rate of video resources at the head and tail positions exists.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, a computer-readable storage medium, and a computer program product for generating a summarized video clip that can improve the utilization of video resources.
In a first aspect, the present application provides a method for generating a summarized video clip. The method comprises the following steps:
when a target object video to be played exists, acquiring a plurality of video clips aiming at the video to be played;
According to the plurality of video clips, determining a video clip set matched with the selection bias information of the target object, and synthesizing the video clips based on the video clip set to obtain at least two synthesized video clips;
based on the selection deflection information and the video to be played, respectively carrying out playing completion degree evaluation on at least two synthesized video clips to obtain a corresponding evaluation result of each synthesized video clip;
and determining the abstract video segments of the target object aiming at the video to be played from the at least two synthesized video segments based on the respective evaluation results of the at least two synthesized video segments.
In one embodiment, determining a set of video segments that match selection bias information for a target object from a plurality of video segments comprises:
respectively constructing vector representations corresponding to the video clips, and constructing vector representations of selection bias information of the target object;
fusing vector representations corresponding to the video clips with vector representations of the selection deflection information respectively to obtain fused vector representations corresponding to the video clips;
a set of video segments matching the selection bias information of the target object is determined based on respective fusion vector representations of the plurality of video segments.
In one embodiment, constructing respective vector representations of each of a plurality of video segments includes:
for each target video segment in the plurality of video segments, extracting video frames of the target video segment to obtain multi-frame target video frames corresponding to the target video segment;
constructing respective vector representations of multi-frame target video frames, and constructing vector representations of descriptive contents of target video segments;
and carrying out multi-mode fusion on the vector representations corresponding to the multi-frame target video frames and the vector representations of the descriptive contents of the target video segments to obtain the vector representations corresponding to the target video segments.
In one embodiment, constructing respective vector representations of the multi-frame target video frames includes:
for each target video frame in multi-frame target video frames, segmenting the target video frame to obtain a plurality of video frame area blocks;
vector conversion is carried out on a plurality of video frame area blocks respectively, and area block vectors corresponding to each video frame area block are obtained;
and encoding based on the corresponding regional block vectors of the regional blocks of the video frames to obtain the vector representation of the targeted target video frame.
In one embodiment, the summary video segment generating method further includes:
When the video to be played is required to be played, playing the abstract video clips and the video to be played according to the playing sequence of the abstract video clips and the video to be played, which are specified by the types of the abstract video clips.
In a second aspect, the application further provides a device for generating the abstract video clips. The device comprises:
the video segment acquisition module is used for acquiring a plurality of video segments aiming at the video to be played when the video to be played of the target object exists;
the video segment synthesis module is used for determining a video segment set matched with the selection bias information of the target object according to the plurality of video segments, and synthesizing the video segments based on the video segment set to obtain at least two synthesized video segments;
the video segment evaluation module is used for evaluating the playing completion degree of at least two synthesized video segments based on the selection deflection information and the video to be played, and obtaining a corresponding evaluation result of each synthesized video segment;
and the video segment selection module is used for determining the abstract video segment of the target object aiming at the video to be played from at least two synthesized video segments based on the respective evaluation results of the at least two synthesized video segments.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:
when a target object video to be played exists, acquiring a plurality of video clips aiming at the video to be played;
according to the plurality of video clips, determining a video clip set matched with the selection bias information of the target object, and synthesizing the video clips based on the video clip set to obtain at least two synthesized video clips;
based on the selection deflection information and the video to be played, respectively carrying out playing completion degree evaluation on at least two synthesized video clips to obtain a corresponding evaluation result of each synthesized video clip;
and determining the abstract video segments of the target object aiming at the video to be played from the at least two synthesized video segments based on the respective evaluation results of the at least two synthesized video segments.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
When a target object video to be played exists, acquiring a plurality of video clips aiming at the video to be played;
according to the plurality of video clips, determining a video clip set matched with the selection bias information of the target object, and synthesizing the video clips based on the video clip set to obtain at least two synthesized video clips;
based on the selection deflection information and the video to be played, respectively carrying out playing completion degree evaluation on at least two synthesized video clips to obtain a corresponding evaluation result of each synthesized video clip;
and determining the abstract video segments of the target object aiming at the video to be played from the at least two synthesized video segments based on the respective evaluation results of the at least two synthesized video segments.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:
when a target object video to be played exists, acquiring a plurality of video clips aiming at the video to be played;
according to the plurality of video clips, determining a video clip set matched with the selection bias information of the target object, and synthesizing the video clips based on the video clip set to obtain at least two synthesized video clips;
Based on the selection deflection information and the video to be played, respectively carrying out playing completion degree evaluation on at least two synthesized video clips to obtain a corresponding evaluation result of each synthesized video clip;
and determining the abstract video segments of the target object aiming at the video to be played from the at least two synthesized video segments based on the respective evaluation results of the at least two synthesized video segments.
According to the method, the device, the computer equipment, the storage medium and the computer program product for generating the abstract video segments, under the condition that the video to be played of the target object exists, a plurality of video segments of the video to be played are obtained, a video segment set matched with selection deflection information of the target object is determined according to the plurality of video segments, selection of the video segment set can be achieved through combining the selection deflection information, the video segment synthesis is carried out on the basis of the video segment set, at least two synthetic video segments can be obtained, and therefore the candidate abstract video segments can be obtained.
Drawings
FIG. 1 is an application environment diagram of a method for generating summarized video clips in one embodiment;
FIG. 2 is a flowchart of a method for generating a summarized video clip according to one embodiment;
FIG. 3 is a schematic diagram of constructing a vector representation of selection bias information for a target object in one embodiment;
FIG. 4 is a schematic diagram of a matching degree calculation model in one embodiment;
FIG. 5 is a schematic diagram of constructing a vector representation of descriptive content of a target video segment in one embodiment;
FIG. 6 is a schematic diagram of a video clip representation model in one embodiment;
FIG. 7 is a schematic diagram of a video frame depth representation model in one embodiment;
FIG. 8 is a schematic diagram of constructing a first vector representation of a video to be played in one embodiment;
FIG. 9 is a schematic diagram of constructing a second vector representation of a target composite video segment in one embodiment;
FIG. 10 is a diagram of a full play probability assessment model in one embodiment;
FIG. 11 is a schematic diagram of a continued play probability assessment model in one embodiment;
FIG. 12 is a diagram of determining matching degree of the abstract sub-segments and the selection bias information of the target object according to one embodiment;
FIG. 13 is a schematic diagram of a video segment quality assessment model in one embodiment;
FIG. 14 is a schematic flow diagram of head-to-tail construction of a target object personalized tile in one embodiment;
FIG. 15 is a schematic diagram of a play interest level calculation model in one embodiment;
FIG. 16 is a schematic diagram of a playback intent model of a target object versus a composite clip in one embodiment;
FIG. 17 is a schematic diagram of a continuous play probability estimation model according to one embodiment;
FIG. 18 is a schematic diagram of a playback intent model of a target object versus a dynamic trailer in one embodiment;
FIG. 19 is a schematic diagram of a playback intent model for a next video set in one embodiment;
FIG. 20 is a block diagram showing a structure of a digest video clip generation apparatus according to an embodiment;
fig. 21 is an internal structural view of a computer device in one embodiment.
Detailed Description
The scheme provided by the embodiment of the application relates to the technical field of artificial intelligence. Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The application mainly relates to natural language processing technology. Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The method for generating the abstract video clips provided by the embodiment of the application can be applied to an application environment shown in fig. 1. The terminal 102 communicates with the server 104 via a network, and is a terminal used by a target object. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers. When there is a video to be played of the target object, the server 104 may obtain a plurality of video clips for the video to be played, determine a video clip set matching with selection bias information of the target object according to the plurality of video clips, perform video clip synthesis based on the video clip set to obtain at least two synthesized video clips, perform play completion evaluation on the at least two synthesized video clips based on the selection bias information and the video to be played, respectively, obtain a corresponding evaluation result of each synthesized video clip, determine a summary video clip for the video to be played of the target object from the at least two synthesized video clips based on the respective evaluation results of the at least two synthesized video clips, and feed the summary video clip back to the terminal 102 for playing. The terminal 102 may be, but not limited to, various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers, or may be a cloud server.
In one embodiment, as shown in fig. 2, there is provided a method for generating a summarized video clip, which may be executed by a terminal or a server alone or in combination with the terminal and the server, and in an embodiment of the present application, the method is described by way of example as being executed by the server, and includes the steps of:
step 202, when a video to be played of a target object exists, acquiring a plurality of video clips aiming at the video to be played.
The target object refers to an object to be watched with the video to be played. For example, the target object may specifically refer to an account number that is about to watch the video to be played. For another example, the target object may specifically refer to a user who is about to watch the video to be played. The video to be played refers to the video to be played. For example, the video to be played may specifically refer to a certain episode of a tv show to be played. For another example, the video to be played may specifically refer to a movie to be played.
The video clips refer to clips obtained by splitting video data, and the plurality of video clips for the video to be played refer to a plurality of video clips associated with the video to be played. For example, the plurality of video clips for the video to be played may be clips split from the video to be played. For another example, the plurality of video clips for the video to be played may be clips split from the video associated with the video to be played. For example, when the video to be played is a certain episode in a television series, the plurality of video clips for the video to be played may be clips split from the episode, clips split from episodes with play orders before the play order of the episode, or clips split from episodes with play orders after the play order of the episode.
Specifically, when a video to be played of a target object exists, the server acquires a plurality of video clips aiming at the video to be played.
In a specific application, when the target object uses the video application in the terminal, any video data can be selected as the video to be played, after the target object selects the video to be played, the server which is in communication with the terminal used by the target object through the network can confirm that the video to be played of the target object exists, and then a plurality of video clips aiming at the video to be played can be obtained. For example, when the target object views a television play using the video application in the terminal, any set of television plays may be selected as the video to be played.
In a specific application, when the target object uses the video application in the terminal and the video application is playing a certain video data, if the video data has the video data with the corresponding playing order behind and next to each other and the video data is about to be played, the server can take the video data with the playing order behind and next to each other as the video to be played, so as to realize the confirmation of the video to be played with the target object. For example, when the target object uses the video application in the terminal to watch the tv drama and is playing the nth tv drama, if there is the n+1th tv drama and the nth tv drama is about to be played, the n+1th tv drama may be used as the video to be played. Wherein N is a positive integer greater than or equal to 1.
In a specific application, when a video to be played of a target object exists, the server determines a video set to which the video to be played belongs and a position of the video to be played in the video set, so as to select a plurality of video clips aiming at the video to be played from a plurality of candidate video clips of a pre-generated video set. The video set to which the video to be played belongs refers to a set comprising the video to be played. For example, when the video to be played is a certain episode in a television show, the video set to which the video to be played belongs may specifically refer to a complete episode of the television show including the television show. For another example, when the video to be played is a portion of a movie, the video set to which the video to be played belongs may specifically refer to a movie corpus including the portion. The position of the video to be played in the video set is used to describe the order of the video to be played in the video set. For example, when the video set is a complete set of tv shows, the position of the video to be played in the video set may be used to describe what number of the complete set of tv shows the video to be played is.
Step 204, determining a video segment set matched with the selection bias information of the target object according to the plurality of video segments, and synthesizing video segments based on the video segment set to obtain at least two synthesized video segments.
The selection bias information is information indicating selection bias of the target object. For example, the selection bias information may specifically refer to an interest tag that characterizes a selection bias of the target object.
Specifically, the server performs selection bias mining based on selection bias information of the target object, calculates matching degrees of the plurality of video clips and the selection bias information respectively, determines a video clip set matched with the selection bias information of the target object from the plurality of video clips based on the matching degrees of the plurality of video clips and the selection bias information respectively, and performs video clip synthesis based on the video clip set to obtain at least two synthesized video clips.
In a specific application, when calculating the matching degree of the plurality of video clips and the selection bias information respectively, the server may construct vector representations of the plurality of video clips corresponding to the selection bias information of the target object first, and then calculate the matching degree of the plurality of video clips and the selection bias information respectively based on the vector representations of the plurality of video clips corresponding to the selection bias information and the vector representations of the selection bias information. The corresponding vector representation of the video segment is a vector for representing the video segment, and the selection bias information is a vector for representing the selection bias information.
In a specific application, the server may sort the plurality of video clips based on the matching degrees of the plurality of video clips with the selection bias information, and select, from the plurality of video clips, N video clips with the highest matching degrees with the selection bias information as the video clip set. Wherein, N can be configured according to the actual application scene. In a specific application, the server can also directly select the video segments with the matching degree with the selection bias information being greater than the matching degree threshold value to obtain a video segment set. The matching degree threshold value can be configured according to an actual application scene.
In a specific application, when video segment synthesis is performed, the server performs video segment sampling based on the number of video segments in the video segment set, and synthesizes at least two video segments selected in each sampling into a synthesized video segment. In a specific application, assuming that the number of video clips in the video clip set is r×w, the server may sample r times, and select w video clips for each sampling to synthesize a composite video clip. Wherein r and w are positive integers greater than 1.
And 206, based on the selection bias information and the video to be played, respectively carrying out playing completion degree evaluation on at least two synthesized video clips to obtain a corresponding evaluation result of each synthesized video clip.
Specifically, for different synthesized video clips, the probability of whether the video to be played is continuously played is different, so when the playing completion degree is evaluated for at least two synthesized video clips respectively, for each target synthesized video clip in the at least two synthesized video clips, the server can evaluate whether the target synthesized video clip is completely played and whether the video to be played is continuously played under the condition of playing the target synthesized video clip based on the selection bias information and the video to be played, and the corresponding evaluation result of the target synthesized video clip is obtained by combining the complete playing probability of whether the target synthesized video clip is completely played and the continuous playing probability of whether the video to be played is continuously played under the condition of playing the target synthesized video clip.
In a specific application, whether the target composite video segment is completely played is mainly related to the selection bias information and the target composite video segment itself, so that the server can directly evaluate whether the target composite video segment is completely played based on the selection bias information and the target composite video segment itself. In the case of playing the target composite video clip, whether the video to be played continues to be played is related to the selection bias information, the target composite video clip and the video to be played itself, so that the server needs to evaluate whether the video to be played continues to be played in the case of playing the target composite video clip by combining the selection bias information, the target composite video clip and the video to be played itself.
Step 208, determining the abstract video segments of the target object aiming at the video to be played from the at least two synthesized video segments based on the respective evaluation results of the at least two synthesized video segments.
The summary video clips refer to video clips selected from at least two synthesized video clips and played together with the video to be played. For example, the summarized video clip may be a video clip that is played before the video to be played is played. For example, when the video to be played is a certain episode of a television show, the summary video clip may be the beginning of the episode. For another example, the summarized video clip may be a video clip that is played after the video to be played is played. For example, when the video to be played is a certain episode of a television show, the summary video clip may be the clip of the episode. For another example, the summary video clip may be a video clip that is played during the process of playing the video to be played. For example, the summary video clip may be a video clip played when the video to be played is played to a specified play progress. The designated playing progress can be configured according to the actual application scene. For example, the designated playing progress may be 50%, that is, when the video to be played needs to be played and the video to be played is played to 50%, the playing of the video to be played is temporarily interrupted, the playing is turned to playing the summary video segment, and after the playing of the summary video segment is completed, the video to be played is continuously played.
Specifically, the server sorts the at least two synthesized video clips based on respective evaluation results of the at least two synthesized video clips, and determines a summary video clip of the target object for the video to be played from the at least two synthesized video clips according to the sorting results.
In a specific application, the evaluation result may be an evaluation score, and the server may sort at least two synthesized video clips based on the respective evaluation scores of the at least two synthesized video clips, and determine the synthesized video clip with the highest evaluation score as the summary video clip of the target object for the video to be played.
According to the method for generating the abstract video segments, under the condition that the video to be played of the target object exists, the plurality of video segments of the video to be played are obtained, the video segment set matched with the selection bias information of the target object is determined according to the plurality of video segments, the selection of the video segment set can be realized by combining the selection bias information, the video segment synthesis is carried out based on the video segment set, at least two synthesized video segments are obtained, the acquisition of candidate abstract video segments can be realized, the playing completion degree evaluation can be carried out on the at least two synthesized video segments respectively by combining the selection bias information and the video to be played, the abstract video segments of the target object are determined from the at least two synthesized video segments based on the respective evaluation results of the at least two synthesized video segments, the abstract video segments are dynamically constructed for the target object by combining the selection bias information of the video segments in the whole process, the diversity of the abstract video segments can be enriched, and the abstract video segments can not be repeatedly utilized with the change of the selection bias information of the target object, and the abstract video resources can not be increased again due to the fact that the target object is perceived to the playback of the abstract video.
In one embodiment, determining a set of video segments that match selection bias information for a target object from a plurality of video segments comprises:
respectively constructing vector representations corresponding to the video clips, and constructing vector representations of selection bias information of the target object;
fusing vector representations corresponding to the video clips with vector representations of the selection deflection information respectively to obtain fused vector representations corresponding to the video clips;
a set of video segments matching the selection bias information of the target object is determined based on respective fusion vector representations of the plurality of video segments.
The corresponding vector representation of the video segment is a vector for representing the video segment, and the selection bias information is a vector for representing the selection bias information.
Specifically, for each target video segment in the multiple video segments, the server performs video frame extraction on the target video segment to obtain multiple frames of target video frames corresponding to the target video segment, constructs respective vector representations of the multiple frames of target video frames, and constructs respective vector representations of the target video segment based on the respective vector representations of the multiple frames of target video frames. Meanwhile, the server constructs a vector representation of the selection bias information of the target object in a text coding mode. After constructing the vector representations of the video clips and the vector representations of the selection bias information of the target object, the server fuses the vector representations of the video clips with the vector representations of the selection bias information to obtain fusion vector representations of the video clips, and the fusion vector representations sufficiently fuse the characteristics of the video clips and the characteristics of the selection bias information.
In a specific application, the vector representation of the selection bias information of the target object may be constructed by a pre-trained text coding model, which may be trained according to the actual application scenario. For example, the pre-trained text encoding model may specifically be the text converter-Encoder model, i.e., the Encoder layer in the converter model. In a specific application, as shown in fig. 3, the selection bias information may be specifically N interest tags (including the interest tag 1, the interest tag 2, and the … … interest tag N shown in fig. 3), and when constructing a vector representation of the selection bias information of the target object, the server may input the N interest tags into a pre-trained text converter-Encoder model to obtain a vector representation of the selection bias information, where the vector representation of the selection bias information includes respective vector representations of the N interest tags.
In a specific application, the server may fuse the vector representations corresponding to the video clips respectively with the vector representations of the selection bias information based on the attention mechanism, to obtain fused vector representations corresponding to the video clips respectively. The attention mechanism is an allocation mechanism, the core idea is to highlight some important characteristics of objects, reallocate resources, namely weights, according to the importance degree of the attention objects, and the core idea is to find the relevance between the attention objects based on original data and then highlight some important characteristics of the attention objects. In this embodiment, the relevance between the video clips and the selection bias information is found based on the corresponding vector representations of the video clips and the vector representations of the selection bias information, so as to highlight some important features in their vector representations by interactive fusion, so as to determine the set of video clips matching the selection bias information of the target object based on these important features. In a specific application, the server may fuse respective vector representations of the plurality of video segments with respective vector representations of the selection bias information through a pre-trained Cross-Attention layer to obtain respective fused vector representations of the plurality of video segments.
In a specific application, the server calculates the matching degree of the selection bias information corresponding to each of the plurality of video clips based on the fusion vector representations corresponding to each of the plurality of video clips, and determines the video clip set matching the selection bias information of the target object from the plurality of video clips based on the matching degree of the selection bias information corresponding to each of the plurality of video clips. In a specific application, the server may input the fusion vector representations corresponding to each of the plurality of video clips into the pre-trained full-connection layer, respectively, to calculate the matching degree of each of the plurality of video clips with the selection bias information.
In a specific application, the matching degree of the selection bias information corresponding to each of the plurality of video clips can be calculated by a matching degree calculation model as shown in fig. 4, where the matching degree calculation module includes a text transform-Encoder model, a video clip representation model for constructing a vector representation corresponding to the target video clip, a Cross-Attention layer, and a full connection layer. For each target video segment in the plurality of video segments, the server inputs the target video segment into a video segment representation model, constructs a vector representation corresponding to the target video segment, inputs selection bias information of a target object into a text transform-encoding model, constructs a vector representation of selection bias information of the target object, fuses the vector representation corresponding to the target video segment and the vector representation of the selection bias information through a Cross-attribute layer, obtains a fused vector representation corresponding to the target video segment, inputs the fused vector representation corresponding to the target video segment into a full connection layer, and outputs a matching degree corresponding to the target video segment and the selection bias information through the full connection layer.
In this embodiment, by constructing vector representations first and fusing the vector representations, fusion vector representations of some important features in the vector representations of the salient video segments and the selection bias information can be obtained through interactive fusion, and further determination of a video segment set matched with the selection bias information of the target object can be achieved based on the fusion vector representations.
In one embodiment, constructing respective vector representations of each of a plurality of video segments includes:
for each target video segment in the plurality of video segments, extracting video frames of the target video segment to obtain multi-frame target video frames corresponding to the target video segment;
constructing respective vector representations of multi-frame target video frames, and constructing vector representations of descriptive contents of target video segments;
and carrying out multi-mode fusion on the vector representations corresponding to the multi-frame target video frames and the vector representations of the descriptive contents of the target video segments to obtain the vector representations corresponding to the target video segments.
The description content of the target video clip refers to content for describing the target video clip. For example, the description content of the target video clip may specifically refer to content extracted based on the target video clip. The description content of the target video clip may be, for example, a subtitle extracted based on the target video clip. The multi-mode fusion refers to integrating information of different modes into a stable multi-mode representation, and in this embodiment, the multi-mode fusion refers to integrating vector representations corresponding to multiple frames of target video frames and vector representations of descriptive contents of target video segments into vector representations corresponding to the target video segments.
Specifically, for each target video segment in the multiple video segments, the server performs video frame extraction on the target video segment to obtain multiple frames of target video frames corresponding to the target video segment, constructs respective vector representations of the multiple frames of target video frames, constructs vector representations of description contents of the target video segment, and performs multi-mode fusion on the respective vector representations of the multiple frames of target video frames and the vector representations of the description contents of the target video segment to obtain the corresponding vector representations of the target video segment.
In a specific application, the video frame extraction may be performed in a manner of uniform frame extraction, that is, frame extraction at equal time intervals, and the number of extracted video frames may be the same for each video segment. The server may construct respective vector representations of the multiple frames of target video frames through a pre-trained video frame depth representation model, and construct vector representations of descriptive content of the target video segments through a pre-trained text coding model. The pre-trained video frame depth representation model can be configured according to an actual application scene. For example, the pre-trained video frame depth representation model may specifically be a ViT (Vision Transformer, visual transducer) model. The pre-trained text coding model can also be configured according to the actual application scene. For example, the pre-trained text encoding model in this embodiment may also be a text converter-Encoder model, i.e., the Encoder layer in the converter model.
In a specific application, as shown in fig. 5, the description content of the target video segment may be specifically N words (including word 1, word 2, and … … words N in fig. 5), and when constructing a vector representation of the description content of the target video segment, the server may input the N words into a pre-trained text converter-Encoder model to obtain a vector representation of the description content of the target video segment, where the vector representation of the description content of the target video segment includes respective vector representations of the N words (word 1 vector representation, word 2 vector representation, and … … words N vector representation shown in fig. 5).
In a specific application, the server may perform multi-mode fusion on the vector representations corresponding to the multi-frame target video frames and the vector representations of the description contents of the target video segments based on the attention mechanism, so as to obtain the vector representations corresponding to the target video segments. In this embodiment, by performing multi-mode fusion using the attention mechanism, it is possible to find the relevance between the description contents of the multi-frame target video frame and the target video segment based on the respective vector representations of the multi-frame target video frame and the vector representations of the description contents of the target video segment, so as to highlight some important features in their vector representations by interactive fusion.
In a specific application, the server may perform multi-modal fusion on the vector representations of the respective multi-frame target video frames and the vector representations of the descriptive contents of the target video segments through a pre-trained multi-modal-converter-Encoder model to obtain the vector representations of the respective target video segments.
In one specific application, the corresponding vector representations of the target video clip can be constructed by a video clip representation model as shown in FIG. 6, including a text converter-Encoder model, a ViT model, and a multimodal converter-Encoder model. For each target video segment in the plurality of video segments, the server performs video frame extraction on the target video segment to obtain multi-frame target video frames (shown as N frames in fig. 6) corresponding to the target video segment, constructs respective vector representations of the multi-frame target video frames through a ViT model, constructs a vector representation of descriptive contents of the target video segment through a text converter-Encoder model (as shown in fig. 6, the descriptive contents of the target video segment include N words, the constructed vector representation includes respective vector representations of the N words), and performs multi-mode fusion on the respective vector representations of the multi-frame target video frames and the respective vector representations of the descriptive contents of the target video segment through a multi-mode converter-Encoder model to obtain respective vector representations of the target video segment. Wherein the multi-modal fusion is performed using the SEP to separate respective vector representations of the multi-frame target video frames from the vector representations of the descriptive content of the target video segments.
In this embodiment, for each target video segment, the vector representations corresponding to the target video segments capturing global information can be obtained by combining the vector representations of different modes by performing multi-mode fusion on the vector representations corresponding to each of the multi-frame target video frames and the vector representations of the description contents of the target video segments.
In one embodiment, constructing respective vector representations of the multi-frame target video frames includes:
for each target video frame in multi-frame target video frames, segmenting the target video frame to obtain a plurality of video frame area blocks;
vector conversion is carried out on a plurality of video frame area blocks respectively, and area block vectors corresponding to each video frame area block are obtained;
and encoding based on the corresponding regional block vectors of the regional blocks of the video frames to obtain the vector representation of the targeted target video frame.
Specifically, for each target video frame in the multi-frame target video frames, the server segments the target video frame to obtain a plurality of video frame region blocks, performs vector conversion on the plurality of video frame region blocks by using a vector conversion function to obtain a region block vector corresponding to each video frame region block, and encodes based on the region block vector corresponding to each of the plurality of video frame region blocks to obtain a vector representation of the target video frame.
In a specific application, the server may segment the targeted video frame according to the number of preconfigured region blocks, to obtain a plurality of video frame region blocks. The number of the preconfigured region blocks can be configured according to an actual application scene. Before the target video frame is segmented, in order to reduce the data processing amount, the server can map the pixel value of the pixel point in the target video frame to a preset range, and the preset range can be configured according to the actual application scene. For example, the preset range may be [0 1], and mapping the pixel values of the pixel points in the target video frame to the preset range may be understood as normalizing the pixel values.
In a specific application, the vector conversion function can be configured according to an actual application scene. For example, the vector transformation function may specifically be a flat function, which may be used to flatten the matrix into a vector. In this embodiment, a matrix formed by pixel values of pixel points in a video frame region block is flattened, so as to obtain a region block vector corresponding to the video frame region block.
In a specific application, the server may construct respective vector representations of the multi-frame target video frames through a pre-trained video frame depth representation model. The pre-trained video frame depth representation model can be configured according to the actual application scene. In one particular application, a pre-trained video frame depth representation model may be specifically shown in FIG. 7, which includes an Embedded Patches layer, a transducer-Encoder layer, and an MLP Head layer. For each target video frame in the multi-frame target video frames, the server normalizes the pixel values of the pixels in the target video frame through an Embedded blocks layer, then segments the target video frame to obtain a plurality of video frame region blocks (9 video frame region blocks are shown in fig. 7), respectively performs vector conversion (i.e. Linear Projection of Flattened Patches (linear projection) in fig. 7) on the plurality of video frame region blocks by using a vector conversion function to obtain a region block vector corresponding to each video frame region block, inputs the region block vector corresponding to each of the plurality of video frame region blocks to a transform-Encoder layer, and subjects the vector processed by the transform-Encoder layer to an MLP Head (full link layer) to obtain a vector representation of the target video frame, i.e. the output of the full link layer is the vector representation of the target video frame.
In a specific application, the structure of the transform-Encoder layer in this embodiment may be shown in FIG. 7, and includes L (denoted by Lx in FIG. 7) Encoder modules, each of which includes two Norm layers, a Multi-Head Attention layer, and an MLP (Multi-layer perceptron). When training the pre-trained video frame depth representation model, the training data acquired by the server may be sample data corresponding to the classification task, and as shown in fig. 7, the classification of the target video frame by the pre-trained video frame depth representation model is actually a classification model, but in this embodiment, the purpose is mainly to acquire a vector representation of the target video frame output by the full-connection layer, rather than to classify the target video frame.
In this embodiment, for each target video frame in the multi-frame target video frames, the accurate construction of the vector representation of the target video frame can be achieved by performing the target video frame segmentation, then performing the vector conversion, and finally performing the encoding.
In one embodiment, based on selection bias information and a video to be played, respectively performing playback completion evaluation on at least two synthesized video clips, and obtaining a corresponding evaluation result of each synthesized video clip includes:
Acquiring vector representation of selection deflection information, and constructing a first vector representation of a video to be played;
constructing a second vector representation of the target composite video segment for each of the at least two composite video segments;
and based on the vector representation, the first vector representation and the second vector representation of the selection deflection information, performing playing completion degree evaluation on the target synthesized video segment to obtain a corresponding evaluation result of the target synthesized video segment.
Wherein the first vector representation refers to a vector for characterizing a video to be played. The second vector representation refers to a vector used to characterize the target composite video segment.
Specifically, the server obtains vector representations of selection bias information, constructs a first vector representation of the video to be played based on a plurality of video segments to be played in the video to be played, constructs a second vector representation of the target composite video segment based on a plurality of composite sub-segments in the target composite video segment for each of at least two target composite video segments, evaluates whether the target composite video segment is completely played based on the vector representations of the selection bias information and the second vector representations, evaluates whether the video to be played continues to play in the case of playing the target composite video segment based on the vector representations of the selection bias information, the first vector representations and the second vector representations, and combines the complete play probability of whether the video to be played is completely played and the continuous play probability of whether the video to be played continues to play in the case of playing the target composite video segment to obtain a corresponding evaluation result of the target composite video segment.
In this embodiment, by obtaining the vector representation of the selection bias information, a first vector representation of the video to be played is constructed, and a second vector representation of the target composite video segment is constructed for each target composite video segment in the at least two composite video segments, so that the playing completion degree evaluation of the target composite video segment can be implemented by using the vector representation, and a corresponding evaluation result of the target composite video segment is obtained.
In one embodiment, constructing a first vector representation of a video to be played includes:
selecting a plurality of video clips to be played from the video to be played, and respectively constructing respective vector representations of the video clips to be played;
and fusing the vector representations corresponding to the video clips to be played to obtain a first vector representation of the video to be played.
The corresponding vector representation of the video clip to be played refers to a vector for representing the video clip to be played.
Specifically, the server selects a plurality of video segments to be played from the videos to be played, constructs respective vector representations of the video segments to be played, fuses the respective vector representations of the video segments to be played based on an attention mechanism, and obtains a first vector representation of the video to be played. In this embodiment, by fusing the respective vector representations of the plurality of video clips to be played based on the attention mechanism, it is possible to find the relevance between the plurality of video clips to be played based on the respective vector representations of the plurality of video clips to be played, so as to highlight some important features in their vector representations through interactive fusion.
In a specific application, the server disassembles the video to be played to obtain a plurality of disassembled video segments corresponding to the video to be played, and then selects a plurality of video segments to be played from the plurality of disassembled video segments, wherein the number of the selected video segments to be played can be configured according to actual application scenes. The server can respectively construct respective vector representations of a plurality of video clips to be played through a pre-trained video clip representation model to be played, and the video clip representation model to be played can be trained according to actual application scenes. In a specific application, the video clip representation model to be played may be as shown in fig. 6, that is, for the target video clip and the video clip to be played, the same video clip representation model may be used to construct the corresponding vector representations respectively. The server can fuse respective vector representations of a plurality of video clips to be played through a pre-trained multi-mode-converter-Encoder model to obtain a first vector representation of the video to be played.
In a specific application, the first vector representation of the video to be played may be constructed by a video depth representation model to be played as shown in fig. 8, where the video depth representation model to be played includes a plurality of video clip representation models (L are illustrated in fig. 8, and the selected video clip to be played is L) and a multi-modal-converter-Encoder model. The server selects L video segments to be played from the video to be played, constructs respective corresponding vector representations (video segment 1 vector representation, video segment 2 vector representation, … … video segment L vector representation) of the L video segments to be played respectively through L video segment representation models, fuses the respective corresponding vector representations of the L video segments to be played through a multi-mode-converter-Encoder model, and obtains a first vector representation of the video to be played, wherein the first vector representation comprises the vector representations of the L video segments to be played after fusion.
In this embodiment, a plurality of video segments to be played are selected from the video to be played, so as to respectively construct respective vector representations of the plurality of video segments to be played, and the respective vector representations of the plurality of video segments to be played are fused, so that a first vector representation of the video to be played capturing global information can be obtained by combining the respective vector representations of the plurality of video segments to be played.
In one embodiment, a target composite video clip includes a plurality of composite sub-clips; constructing a second vector representation of the target composite video segment includes:
respectively constructing respective vector representations of a plurality of synthesis subfragments;
and fusing the vector representations corresponding to the synthesis sub-segments to obtain a second vector representation of the target synthesis video segment.
Wherein the corresponding vector representation of the synthesized subfragment refers to a vector used to characterize the synthesized subfragment.
Specifically, the server respectively constructs respective vector representations of the plurality of synthesized sub-segments, fuses the respective vector representations of the plurality of synthesized sub-segments based on the attention mechanism, and obtains a second vector representation of the target synthesized video segment. In this embodiment, by fusing the respective vector representations of the plurality of synthetic sub-segments based on the attention mechanism, it is possible to find the relevance between the plurality of synthetic sub-segments based on the respective vector representations of the plurality of synthetic sub-segments to highlight some important features in their vector representations by interactive fusion.
In a specific application, the server can respectively construct respective vector representations of a plurality of synthesis subfragments through a pre-trained synthesis subfragment representation model, and the synthesis subfragment representation model can be trained according to actual application scenes. In a specific application, the composite sub-segment representation model may be as shown in fig. 6, i.e. for the target video segment, the video segment to be played and the composite sub-segment, the same video segment representation model may be used to construct the corresponding vector representations, respectively. The server may fuse respective vector representations of the plurality of composite sub-segments via a pre-trained multimodal-transducer-Encoder model to obtain a second vector representation of the target composite video segment.
In one specific application, the second vector representation of the target composite video clip may be constructed by a target composite video clip depth representation model as shown in FIG. 9, which includes a plurality of video clip representation models (w are illustrated in FIG. 9, then the number of composite sub-clips included in the target composite video clip is w) and a multi-modal-transducer-Encoder model. The server respectively constructs respective vector representations (video segment 1 representation, … … video segment s representation, … … video segment w representation) of the w synthesized sub-segments through w video segment representation models, fuses the respective vector representations of the w synthesized sub-segments through a multi-mode-converter-encoding model, and obtains a second vector representation of the target synthesized video segment, wherein the second vector representation comprises respective vector representations of the w synthesized sub-segments after fusion.
In this embodiment, by respectively constructing the vector representations corresponding to the plurality of synthesis sub-segments, and fusing the vector representations corresponding to the plurality of synthesis sub-segments, the second vector representation of the target synthesized video segment capturing global information can be obtained by combining the vector representations corresponding to the plurality of synthesis sub-segments.
In one embodiment, performing playback completion evaluation on the target composite video segment based on the vector representation, the first vector representation, and the second vector representation of the selection bias information, and obtaining a corresponding evaluation result for the target composite video segment includes:
based on the vector representation and the second vector representation of the selection bias information, performing complete play probability evaluation on the target synthesized video segment to obtain the complete play probability of the target synthesized video segment;
based on the vector representation, the first vector representation and the second vector representation of the selection deflection information, carrying out continuous play probability evaluation on the video to be played, and obtaining continuous play probability of the video to be played;
and combining the complete playing probability and the continuous playing probability to obtain a corresponding evaluation result of the target synthesized video clip.
The full play probability assessment is used for assessing the probability that the target synthesized video clip is played completely. The continued play probability evaluation is used for evaluating the probability that the video to be played is continued to be played in the case of playing the target composite video clip.
Specifically, the server fuses the vector representation of the selection deflection information and the second vector representation to obtain a fused vector representation of the target synthesized video segment, performs complete play probability evaluation based on the fused vector representation of the target synthesized video segment to obtain complete play probability of the target synthesized video segment, fuses the vector representation of the selection deflection information, the first vector representation and the second vector representation to obtain a target fused vector representation of the video to be played, performs continuous play probability evaluation based on the target fused vector representation of the video to be played to obtain continuous play probability of the video to be played, and combines the complete play probability and the continuous play probability to obtain a corresponding evaluation result of the target synthesized video segment.
In a specific application, the server may use the product of the complete playing probability and the continuous playing probability as the corresponding evaluation result of the target synthesized video segment, or may perform weighted summation on the complete playing probability and the continuous playing probability, and use the weighted summation result as the corresponding evaluation result of the target synthesized video segment. The weight coefficient in the weighted summation may be configured according to the actual application scenario, which is not specifically limited in this embodiment.
In this embodiment, by performing the complete play probability evaluation on the target composite video segment, the complete play probability of the target composite video segment is obtained, the continuous play probability evaluation is performed on the video to be played, the continuous play probability of the video to be played is obtained, and the play completion degree evaluation can be performed on the target composite video segment by considering both the complete play probability and the continuous play probability, so as to obtain the corresponding evaluation result of the target composite video segment.
In one embodiment, performing a full play probability evaluation on the target composite video clip based on the vector representation and the second vector representation of the selection bias information, the obtaining the full play probability of the target composite video clip comprises:
fusing the vector representation of the selection bias information and the second vector representation to obtain a fused vector representation of the target synthesized video segment;
and carrying out complete play probability evaluation based on the fusion vector representation of the target synthesized video segment to obtain the complete play probability of the target synthesized video segment.
Specifically, the server fuses the vector representation of the selection bias information and the second vector representation based on the attention mechanism to obtain a fused vector representation of the target composite video segment, and performs the complete play probability evaluation based on the fused vector representation of the target composite video segment to obtain the complete play probability of the target composite video segment. In a specific application, the server can fuse the vector representation of the selection bias information and the second vector representation through the pre-trained Cross-Attention layer to obtain a fused vector representation of the target synthetic video segment, and then input the fused vector representation of the target synthetic video segment into the pre-trained full-connection layer to obtain the complete playing probability of the target synthetic video segment.
In one specific application, the full play probability of the target composite video clip may be obtained by a full play probability assessment model as shown in fig. 10, which includes a target composite video clip depth representation model for constructing a second vector representation of the target composite video clip, a text encoding model for constructing a vector representation of selection bias information, a Cross-Attention layer, and a full connection layer. The server inputs the target synthesized video segment into a target synthesized video segment depth representation model, constructs a second vector representation of the target synthesized video segment, inputs selection bias information into a text coding model, constructs a vector representation of the selection bias information, fuses the vector representation of the selection bias information and the second vector representation through a Cross-Attention layer to obtain a fused vector representation of the target synthesized video segment, and performs complete play probability evaluation based on the fused vector representation of the target synthesized video segment through a full connection layer to obtain the complete play probability of the target synthesized video segment.
In this embodiment, by fusing the vector representation of the selection bias information and the second vector representation, a fused vector representation of the target synthesized video segment with prominent important features can be obtained, and further, the complete play probability evaluation can be performed based on the fused vector representation of the target synthesized video segment, so as to obtain the complete play probability of the target synthesized video segment.
In one embodiment, based on the vector representation, the first vector representation, and the second vector representation of the selection bias information, performing a continuing play probability evaluation on the video to be played, obtaining the continuing play probability of the video to be played includes:
fusing the vector representation of the selection deflection information and the first vector representation to obtain a fused vector representation of the video to be played;
fusing the vector representation of the selection bias information and the second vector representation to obtain a fused vector representation of the target synthesized video segment;
fusing the fusion vector representation of the video to be played and the fusion vector representation of the target synthesized video segment to obtain the target fusion vector representation of the video to be played;
and carrying out continuous play probability evaluation based on the target fusion vector representation of the video to be played, and obtaining the continuous play probability of the video to be played.
Specifically, the server fuses the vector representation of the selection deflection information and the first vector representation based on the attention mechanism to obtain a fusion vector representation of the video to be played, fuses the vector representation of the selection deflection information and the second vector representation based on the attention mechanism to obtain a fusion vector representation of the target synthesized video segment, splices the fusion vector representation of the video to be played and the fusion vector representation of the target synthesized video segment to obtain a target fusion vector representation of the video to be played, and carries out continuous play probability evaluation based on the target fusion vector representation of the video to be played to obtain continuous play probability of the video to be played. In a specific application, the server can realize fusion of vector representations based on an Attention mechanism through a pre-trained Cross-Attention layer, and after obtaining a target fusion vector representation, the target fusion vector representation can be input into a pre-trained full-connection layer to obtain the continuous playing probability of the video to be played.
In one specific application, the continued play probability of the video to be played may be obtained by a continued play probability evaluation model as shown in fig. 11, which includes a target composite video segment depth representation model for constructing a second vector representation of the target composite video segment, a text encoding model for constructing a vector representation of selection bias information, a video depth representation model to be played for constructing a first vector representation of the video to be played, two Cross-Attention layers, and a full join layer.
The method comprises the steps that a target synthetic video segment is input into a target synthetic video segment depth representation model by a server, a second vector representation of the target synthetic video segment is built, selection deflection information is input into a text coding model, a vector representation of the selection deflection information is built, a video to be played is input into a video depth representation model to be played, a first vector representation of the video to be played is built, the vector representation of the selection deflection information and the first vector representation are fused through a Cross-Attention layer, a fusion vector representation of the video to be played is obtained, the vector representation of the selection deflection information and the second vector representation are fused through a Cross-Attention layer, a fusion vector representation of the target synthetic video segment is obtained, the fusion vector representation of the video to be played and the fusion vector representation of the target synthetic video segment are spliced, a target fusion vector representation of the video to be played is obtained, and continuous playing probability evaluation is carried out based on the target fusion vector representation of the video to be played through a full connection layer, so that continuous playing probability of the video to be played is obtained.
In this embodiment, the vector representation of the selection deflection information and the first vector representation are fused, so that a fusion vector representation of the video to be played with prominent important features can be obtained, and the vector representation of the selection deflection information and the second vector representation are fused, so that a fusion vector representation of the target synthesized video segment with prominent important features can be obtained, and further, the fusion vector representation of the video to be played and the fusion vector representation of the target synthesized video segment can be combined to obtain a target fusion vector representation of the video to be played with abundant features, so that continuous play probability evaluation can be performed based on the target fusion vector representation of the video to be played, and continuous play probability of the video to be played can be obtained.
In one embodiment, the summarized video clip comprises a plurality of summarized sub-clips; after determining the summary video segments of the target object for the video to be played from the at least two synthesized video segments based on the respective evaluation results of the at least two synthesized video segments, the method further comprises:
acquiring vector representations of the summarized video segments and selecting vector representations of the bias information; the vector representation of the summarized video segment includes respective vector representations of a plurality of summarized sub-segments;
Fusing vector representations corresponding to the plurality of abstract sub-segments with vector representations of the selection deflection information respectively to obtain fused vector representations corresponding to the plurality of abstract sub-segments;
segment sorting is carried out based on the fusion vector representations corresponding to the summary sub-segments respectively, and a segment sorting result is obtained;
and according to the segment sequencing result, the sequence of a plurality of abstract sub-segments in the abstract video segments is adjusted to obtain the target abstract video segments.
Specifically, in the process of generating the summary video segment by using the plurality of video segments of the video to be played, for the summary video segment, the playing sequence of the plurality of summary sub-segments is determined only based on the matching degree of the selection bias information of the target object, which may cause the generated summary video segment to lack the overall understanding of the content of the summary sub-segments, and affect the playing effect of the summary video segment, so that after the summary video segment is generated, the server performs joint modeling based on the vector representation of the summary video segment and the vector representation of the selection bias information, re-predicts the matching degree of the plurality of summary sub-segments in the summary video segment and the selection bias information of the target object, and ranks and adjusts the positions of the summary sub-segments in the summary video segment based on the matching degree.
Specifically, the server obtains vector representations of the summarized video segments and vector representations of selection deflection information, fuses vector representations of a plurality of summarized sub-segments included in the vector representations of the summarized video segments with the vector representations of the selection deflection information based on an attention mechanism to obtain fused vector representations of the summarized sub-segments, determines matching degrees of the summarized sub-segments and the selection deflection information of the target object based on the fused vector representations of the summarized sub-segments, sorts the summarized sub-segments based on the matching degrees of the summarized sub-segments and the selection deflection information of the target object to obtain segment sorting results, and adjusts sequences of the summarized sub-segments in the summarized video segments to obtain the target summarized video segments according to the segment sorting results. In a specific application, the server can fuse respective vector representations of the plurality of abstract sub-segments with respective vector representations of the selection bias information through a pre-trained Cross-Attention layer to obtain respective fused vector representations of the plurality of abstract sub-segments.
In a specific application, a schematic diagram for determining the matching degree of the selection bias information of the target object and the plurality of summary sub-segments may be shown in fig. 12, where the server may obtain a vector representation of the summary video segment and a vector representation of the selection bias information, the vector representation of the summary video segment includes vector representations (shown in fig. 12 as summary sub-segment 1 vector representation, … … summary sub-segment s vector representation, … … summary sub-segment w vector representation) corresponding to the plurality of summary sub-segments, to determine the matching degree of the summary sub-segment s and the selection bias information of the target object as an example, and the server may fuse the summary sub-segment s vector representation corresponding to the summary sub-segment s with the vector representation of the selection bias information through a Cross-Attention layer to obtain a fused vector representation corresponding to the summary sub-segment s, and obtain the matching degree of the summary sub-segment s and the selection bias information of the target object based on the fused vector representation corresponding to the summary sub-segment s through a full connection layer.
In this embodiment, the plurality of summary sub-segments in the summary video segment can be sorted based on the vector representation of the summary video segment and the vector representation of the selection bias information to obtain a segment sorting result, and then the sequence of the plurality of summary sub-segments in the summary video segment can be adjusted according to the segment sorting result to obtain the target summary video segment that is more matched with the selection bias information of the target object.
In one embodiment, obtaining a plurality of video clips for a video to be played includes:
determining a video set to which the video to be played belongs and the position of the video to be played in the video set;
determining a video range for generating a summary video clip in the video set based on the position of the video to be played in the video set;
and selecting a plurality of video clips of the video range from a plurality of candidate video clips of the pre-generated video set.
The video set to which the video to be played belongs refers to a set comprising the video to be played. For example, when the video to be played is a certain episode in a television show, the video set to which the video to be played belongs may specifically refer to a complete episode of the television show including the television show. For another example, when the video to be played is a portion of a movie, the video set to which the video to be played belongs may specifically refer to a movie corpus including the portion. The position of the video to be played in the video set is used to describe the order of the video to be played in the video set. For example, when the video set is a complete set of tv shows, the position of the video to be played in the video set may be used to describe what number of the complete set of tv shows the video to be played is.
Specifically, the server determines a video set to which the video to be played belongs and a position of the video to be played in the video set based on the video identifier of the video to be played, determines a video range for generating the summary video segment in the video set based on the position of the video to be played in the video set and the type of the summary video segment to be generated, and selects a plurality of video segments of the video range from a plurality of candidate video segments of the pre-generated video set. The video identifier is used for uniquely identifying the video to be played and can be configured according to the actual application scene.
In particular applications, the determined video range may be different for different types of summarized video clips. For example, when the summary video clip is the head of a clip, it may be determined that the video range is the video in the video set that is in order before the video to be played or the video in the video set that is in order before the video to be played and the video to be played. For another example, when the summary video segment is a clip, it may be determined that the video range is a video in the video set that follows the video to be played or a video in the video set that follows the video to be played and the video to be played.
In this embodiment, by determining the video set to which the video to be played belongs and the position of the video to be played in the video set, the video range for generating the summary video segment in the video set can be determined by using the position of the video to be played in the video set, so that a plurality of video segments in the video range can be selected from a plurality of candidate video segments in the pre-generated video set, and the acquisition of a plurality of video segments for the video to be played is realized.
In one embodiment, the plurality of candidate video clips are obtained by video clip quality analysis of a video collection, the video clip quality analysis of the video collection comprising:
segment disassembly is carried out on the video set to obtain a plurality of segments to be processed;
respectively carrying out segment quality evaluation on a plurality of segments to be processed to obtain a segment quality evaluation result corresponding to each segment to be processed;
and determining a plurality of candidate video clips from the plurality of clips to be processed based on respective clip quality evaluation results of the plurality of clips to be processed.
Specifically, the plurality of candidate video clips are obtained by performing video clip quality analysis on the video set, when the video set is subjected to video clip quality analysis, the server can perform clip disassembly on the video set, disassemble a plurality of to-be-processed clips needing to be subjected to clip quality evaluation from the video set, respectively construct vector representations corresponding to the plurality of to-be-processed clips, respectively perform clip quality evaluation on the plurality of to-be-processed clips based on the vector representations corresponding to the plurality of to-be-processed clips, obtain a clip quality evaluation result corresponding to each to-be-processed clip, and select a plurality of candidate video clips with better clip quality from the plurality of to-be-processed clips based on the clip quality evaluation result corresponding to the plurality of to-be-processed clips.
In a specific application, the segment quality evaluation result may be a segment quality evaluation score, and the server may select multiple candidate video segments with better segment quality based on the segment quality evaluation score. In a specific application, the server may directly use the to-be-processed segment with the segment quality evaluation score greater than the preconfigured score threshold as a candidate video segment with better segment quality. The pre-configured score threshold value can be configured according to an actual application scene.
In this embodiment, by performing video segment quality analysis on a video set, multiple candidate video segments with good segment quality can be selected, so that the generation of the abstract video segments can be realized by using the multiple candidate video segments with good segment quality.
In this embodiment, obtaining a plurality of segments to be processed by performing video segment quality analysis on a video set may be used for summarizing the video set in one embodiment, where obtaining the segments to be processed includes:
segment disassembly is carried out on the video set to obtain a plurality of disassembled segments;
extracting video frames from the multiple disassembled segments respectively to obtain multi-frame disassembled video frames corresponding to each disassembled segment;
Aiming at each target dismantling segment in the plurality of dismantling segments, based on multi-frame dismantling video frames corresponding to the target dismantling segment, segment interception is carried out on the target dismantling segment, and an intercepting segment corresponding to the target dismantling segment is obtained;
and determining a plurality of fragments to be processed from the respective intercepted fragments of the plurality of disassembled fragments.
Specifically, the server may respectively disassemble the video in the video set to obtain a plurality of disassembled segments, and respectively extract video frames from the plurality of disassembled segments to obtain multi-frame video frames corresponding to each disassembled segment. For each target disassembled segment in the plurality of disassembled segments, the server determines a video frame for segment interception from multi-frame disassembled video frames corresponding to the target disassembled segment, and segments the target disassembled segment through the video frame for segment interception to obtain a intercepted segment corresponding to the target disassembled segment, and determines a plurality of segments to be processed from the intercepted segments corresponding to the respective disassembled segments according to a preconfigured intercepting segment screening mode.
In a specific application, the server may perform segment disassembly on the videos in the video set according to a pre-configured segment disassembly duration. The time length of the disassembled fragments can be configured according to the actual application scene. For example, a complete shot segment can be captured for typically 3 seconds, so the break-down segment duration can be configured to be 3 seconds. The video frame extraction may be performed in a manner of uniform frame extraction, i.e. equal time interval frame extraction, and the number of video frames extracted for each disassembled segment may be the same.
In a specific application, after obtaining respective cut-out fragments of the plurality of disassembled fragments, the server determines a plurality of fragments to be processed from respective cut-out fragments of the plurality of disassembled fragments according to a preconfigured cut-out fragment screening mode, because the cut-out fragments may be too short and have no effective content. The preconfigured intercepting segment screening mode can be configured according to an actual application scene. For example, the cut-out section screening method may be to use the cut-out section with the section time length greater than the time length threshold as the section to be processed. The duration threshold may be configured according to an actual application scenario.
It should be noted that, the video frame used for segment interception refers to a video frame with a large segment background variation in the target disassembled segment, if no video frame with a large segment background variation exists in the target disassembled segment, the segment interception of the target disassembled segment is not needed, and the target disassembled segment can be directly used as the segment to be processed.
In this embodiment, for each target disassembled segment in the multiple disassembled segments, analysis on the target disassembled segment can be implemented by using multiple frames of disassembled video frames corresponding to the target disassembled segment, and segment interception is performed on the target disassembled segment to obtain an intercepted segment corresponding to the target disassembled segment, so that multiple segments to be processed can be determined from the intercepted segments corresponding to the multiple disassembled segments.
In one embodiment, based on a multi-frame disassembled video frame corresponding to a target disassembled segment, performing segment interception on the target disassembled segment to obtain an intercepted segment corresponding to the target disassembled segment includes:
constructing respective vector representations of multi-frame disassembled video frames corresponding to the target disassembled segments;
determining video frames for segment interception from the multiple frames of disassembled video frames based on respective vector representations of the multiple frames of disassembled video frames;
and taking the video frame for segment interception as a segment interception node, and intercepting the segment of the target disassembled segment to obtain the intercepted segment corresponding to the target disassembled segment.
Wherein the corresponding vector representation of the disassembled video frame refers to a vector used to characterize the disassembled video frame.
Specifically, the server constructs respective vector representations of the multi-frame disassembled video frames corresponding to the target disassembled fragments, selects multi-frame reference video frames from the multi-frame disassembled video frames, calculates reference vector representations for fragment interception based on the respective vector representations of the multi-frame reference video frames, determines video frames for fragment interception from the multi-frame disassembled video frames by using the reference vector representations, takes the video frames for fragment interception as fragment interception nodes, and intercepts the target disassembled fragments to obtain intercepted fragments corresponding to the target disassembled fragments. The reference vector representation refers to a vector representation selected as a reference from the vector representations corresponding to the multiple frames of the integrated reference video frame.
In a specific application, the server can construct respective vector representations of the multi-frame disassembled video frames corresponding to the target disassembled segment through a pre-trained video frame depth representation model. In this embodiment, the pre-trained video frame depth representation model may be configured according to the actual application scenario. For example, the pre-trained video frame depth representation model may specifically be a ViT model.
In a specific application, after determining the segment intercepting node, the server intercepts the target disassembled segment based on the position of the segment intercepting node in the target disassembled segment to obtain the intercepted segment corresponding to the target disassembled segment. In a specific application, the target disassembled segment can be divided into two parts according to the time sequence, if the segment interception node is positioned at the first half part of the target disassembled segment, the video frame for segment interception is removed from the target disassembled segment according to the time sequence of the disassembled video frame before the video frame for segment interception. If the segment intercepting node is positioned at the second half part of the target disassembling segment, the video frames for segment interception and the disassembling video frames with the time sequence behind the video frames for segment interception are removed from the target disassembling segment.
In this embodiment, by constructing vector representations corresponding to respective frames of a multi-frame disassembled video frame corresponding to a target disassembled segment, a video frame for segment interception can be determined from the multi-frame disassembled video frame based on the vector representations corresponding to respective frames of the multi-frame disassembled video frame, and then the video frame for segment interception can be used as a segment interception node to realize segment interception of the target disassembled segment, and an intercepted segment corresponding to the target disassembled segment is obtained.
In one embodiment, determining video frames for segment truncation from the multi-frame disassembled video frames based on respective vector representations of the multi-frame disassembled video frames includes:
selecting a multi-frame reference video frame from the multi-frame disassembled video frames, and carrying out weighted average on vector representations corresponding to the multi-frame reference video frames to obtain reference vector representations corresponding to the target disassembled fragments and used for fragment interception;
respectively determining the similarity between the reference vector representation and the vector representation corresponding to each of the multiple frames of disassembled video frames;
the target vector represents a corresponding disassembled video frame and is used as a video frame for segment interception; the similarity of the target vector representation to the reference vector representation is less than a similarity threshold.
Specifically, the server selects a multi-frame reference video frame from the multi-frame disassembled video frames, performs weighted average on vector representations corresponding to the multi-frame reference video frames to obtain reference vector representations corresponding to the target disassembled segments and used for segment interception, respectively determines similarity between the reference vector representations and the vector representations corresponding to the multi-frame disassembled video frames, determines a target vector representation with similarity smaller than a similarity threshold value with the reference vector representation from the vector representations corresponding to the multi-frame disassembled video frames, and uses the target vector representation as the disassembled video frame for segment interception. The number of the selected multi-frame reference video frames can be configured according to the actual application scene. The weighting coefficient of the weighted average can be configured according to the actual application scene.
In a specific application, the server may determine the similarity between the reference vector representation and the respective vector representations of the multiple frames of disassembled video frames based on cosine similarity calculation, or may determine the similarity between the reference vector representation and the respective vector representations of the multiple frames of disassembled video frames based on distance calculation. In a specific application, for each of the multiple frames of disassembled video frames, the server calculates cosine similarity between the reference vector representation and the corresponding vector representation of the disassembled video frame, and then uses a difference between the preconfigured reference value and the cosine similarity as the similarity between the reference vector representation and the disassembled video frame. The pre-configured reference value can be configured according to an actual application scene.
In this embodiment, by selecting a plurality of frame reference video frames from the plurality of frame disassembly video frames and performing weighted average on respective vector representations of the plurality of frame reference video frames, a reference vector representation for segment interception corresponding to a target disassembly segment can be obtained, so that a video frame for segment interception can be selected from the plurality of frame disassembly video frames on the basis of respectively determining the similarity between the reference vector representation and the respective vector representations of the plurality of frame disassembly video frames.
In one embodiment, performing segment quality evaluation on a plurality of to-be-processed segments, respectively, to obtain a corresponding segment quality evaluation result of each to-be-processed segment includes:
for each target to-be-processed segment in the plurality of to-be-processed segments, extracting video frames of the target to-be-processed segment to obtain multi-frame to-be-processed video frames corresponding to the target to-be-processed segment, and obtaining description contents of the target to-be-processed segment;
constructing respective vector representations of multi-frame to-be-processed video frames, and constructing vector representations of description contents of target to-be-processed fragments;
carrying out multi-mode fusion on vector representations corresponding to the multi-frame to-be-processed video frames and vector representations of descriptive contents of the target to-be-processed fragments to obtain vector representations corresponding to the target to-be-processed fragments;
And carrying out segment quality evaluation based on the vector representation corresponding to the target segment to be processed, and obtaining a segment quality evaluation result corresponding to the target segment to be processed.
The description content of the target to-be-processed fragment refers to the content for describing the target to-be-processed fragment. For example, the description content of the target to-be-processed fragment may specifically refer to content extracted based on the target to-be-processed fragment. For example, the description content of the target to-be-processed segment may be a subtitle extracted based on the target to-be-processed segment. The multi-mode fusion refers to integrating information of different modes into a stable multi-mode representation, and in this embodiment, the multi-mode fusion refers to integrating vector representations corresponding to multiple frames of video frames to be processed and vector representations of descriptive contents of target segments to be processed into vector representations corresponding to target segments to be processed.
Specifically, for each target to-be-processed segment in the plurality of to-be-processed segments, the server performs video frame extraction on the target to-be-processed segment to obtain multi-frame to-be-processed video frames corresponding to the target to-be-processed segment, obtains description contents of the target to-be-processed segment, constructs respective vector representations of the multi-frame to-be-processed video frames through a pre-trained video frame depth representation model, constructs the vector representations of the description contents of the target to-be-processed segment through a pre-trained text coding model, performs multi-mode fusion on the respective vector representations of the multi-frame to-be-processed video frames and the vector representations of the description contents of the target to-be-processed segment to obtain the vector representations corresponding to the target to-be-processed segment, and performs segment quality evaluation based on the vector representations corresponding to the target to-be-processed segment to obtain segment quality evaluation results corresponding to the target to-be-processed segment.
In a specific application, the video frame extraction may be performed in a manner of uniform frame extraction, that is, frame extraction at equal time intervals, and the number of extracted video frames of each target to-be-processed segment may be the same. The pre-trained video frame depth representation model can be configured according to the actual application scene. For example, the pre-trained video frame depth representation model may specifically be a ViT model. The pre-trained text coding model can also be configured according to the actual application scene. For example, the pre-trained text encoding model in this embodiment may also be a text converter-Encoder model, i.e., the Encoder layer in the converter model.
In a specific application, the server may perform multi-mode fusion on the vector representations corresponding to the multi-frame to-be-processed video frames and the vector representations of the description contents of the target to-be-processed fragments based on the attention mechanism, so as to obtain the vector representations corresponding to the target to-be-processed fragments. In this embodiment, by performing multi-mode fusion using the attention mechanism, it is possible to find the relevance between the description contents of the multi-frame to-be-processed video frame and the target to-be-processed fragment based on the vector representations of the multi-frame to-be-processed video frame and the description contents of the target to-be-processed fragment, so as to highlight some important features in their vector representations through interactive fusion. In a specific application, the server may perform multi-mode fusion on respective vector representations of the multi-frame to-be-processed video frames and vector representations of descriptive contents of the target to-be-processed fragments through a pre-trained multi-mode-transducer-Encoder model to obtain respective vector representations of the target to-be-processed fragments.
In a specific application, for each target segment in the plurality of segments to be processed, the segment quality evaluation result corresponding to the target segment to be processed may be obtained by a video segment quality evaluation model as shown in fig. 13, where the video segment quality evaluation model includes a ViT model, a text transducer-Encoder model, a multi-modal transducer-Encoder model, an average pooling layer, and a full-link layer. The server extracts video frames of the target to-be-processed fragments, acquires multi-frame to-be-processed video frames corresponding to the target to-be-processed fragments, acquires description contents of the target to-be-processed fragments, constructs respective vector representations of the multi-frame to-be-processed video frames through a ViT model (as shown in fig. 13, the number of the extracted to-be-processed video frames is N, the constructed vector representations comprise a frame 1 vector representation, a frame 2 vector representation and a … … frame N vector representation), constructs vector representations of the description contents of the target to-be-processed fragments through a text transducer-Encoder model (as shown in fig. 13, the constructed vector representations comprise word 1 vector representations, word 2 vector representations and … … word N vector representations), carries out multi-mode fusion on the respective vector representations of the multi-frame to-be-processed video frames and the vector representations of the description contents of the target to-be-processed fragments through a multi-mode transducer-encodings model, obtains the vector representations corresponding to the target to-be-processed fragments, carries out pooling on the respective vector representations of the target to-be-processed fragments through an averaging pooling layer, carries out pooling on the respective vector representations of the target to-be-processed fragments, and obtains a corresponding evaluation result of the target fragments based on the pooled target to-be-processed fragments. Wherein, the multi-mode fusion is performed by using [ SEP ] to separate the respective vector representations of the multi-frame to-be-processed video frames and the vector representations of the descriptive contents of the target to-be-processed fragments.
In one specific application, the video clip quality assessment model may be trained on playing behavior data of objects browsing video on a video platform. When the video segment quality evaluation model is trained, video data with average playing completion degree higher than the completion degree threshold and interaction rate higher than the interaction rate threshold in the playing behavior data can be used as positive samples, and other video data can be used as negative samples. The completion degree threshold and the interaction rate threshold can be configured according to actual application scenes. The calculation mode of the average playing completion degree can be as follows: average playback completion = total duration that video was played/total number of times played/video duration. The calculation mode of the interaction rate can be as follows: interoperability = number of comments/number of video played. After the training convergence of the video segment quality assessment model, the probability that the output target to-be-processed segment is a positive sample is the segment quality assessment result corresponding to the target to-be-processed segment.
In this embodiment, for each target to-be-processed segment in the multiple to-be-processed segments, the vector representations of the respective vector representations of the multiple frames of to-be-processed video frames and the vector representations of the description contents of the target to-be-processed segments are subjected to multi-mode fusion, so that the vector representations of the target to-be-processed segments capturing global information can be obtained by combining the vector representations of different modes, and further segment quality evaluation can be performed based on the vector representations of the corresponding target to-be-processed segments, so as to obtain segment quality evaluation results corresponding to the target to-be-processed segments.
In one embodiment, the summary video segment generating method further includes:
when the video to be played is required to be played, playing the abstract video clips and the video to be played according to the playing sequence of the abstract video clips and the video to be played, which are specified by the types of the abstract video clips.
Specifically, when the video to be played needs to be played, the server determines the type of the abstract video clip, and plays the abstract video clip and the video to be played according to the play sequence of the abstract video clip and the video to be played specified by the type of the abstract video clip. In a specific application, after determining the playing sequence, the server outputs a playing instruction to the terminal used by the target object according to the playing sequence, so that the terminal used by the target object plays the summary video clip and the video to be played according to the playing sequence.
In a specific application, the playing order of the abstract video clips and the video to be played is different for different types of abstract video clips. For example, when the summary video segment is the first segment, the playing order may be to play the summary video segment first and then play the video to be played. For example, for a movie and television play, when the summary video clip is the clip and the video to be played is the current set, the playing order may be to play the video to be played first and then play the summary video clip. For another example, for a movie and television play, when the summary video segment is the tail and the video to be played is the next set, the playing order may be to play the summary video segment first and then play the video to be played.
In this embodiment, when the video to be played needs to be played, the digest video clips and the video to be played are played according to the playing sequence of the digest video clips and the video to be played, which are specified by the types of the digest video clips, so that sequential playing of the digest video clips and the video to be played can be realized.
In one embodiment, the summary video segment generation method of the present application is described in terms of a beginning of a segment generation and an end of a segment generation applied to video content of equal length in a television play, where the summary video segment includes the beginning of the segment and the end of the segment in the television play, the target object may refer to a user browsing the television play, and the selection bias information of the target object refers to an interest tag of the target object. Under the equal-length video content of a television, a video clip is a video clip at the beginning or end of a clip, typically a few minutes before the beginning and a few minutes after the end of the video.
The inventor considers that the current content of the head and the tail of the video platform is fixed monotonous, users have no interest after watching for two times, so that users can directly skip the head and the tail of the video platform when following a play, further, the resource utilization of the head and the tail of the video platform is reduced.
In one embodiment, as shown in fig. 14, the end-to-end construction of the target object personalized slice may be performed by the terminal or the server alone or in combination with the terminal and the server, and in the embodiment of the present application, the method is performed by the server as an example, and specifically includes the following steps:
1. video clip resource construction:
1.1, video clip disassembly:
the video platform has millions of long videos, and in this embodiment, a video set V (assumed to be a complete episode 1 of a television) is taken as an example for illustration. And (3) segment disassembly is carried out on each set of the video set V, a plurality of segments to be processed are constructed, so that the personalized head and tail of the target object in each set are constructed by using the segments to be processed in the following steps. In this embodiment, there are two main limitations on the multiple fragments to be processed:
1) Longest and shortest segment duration: in this embodiment, the longest segment duration is defined as tmax, for example, 3 seconds, and typically 3 seconds, a complete shot segment can be captured. In this embodiment, the shortest segment duration is defined as tmin, for example, 1 second, and video segments shorter than 1 second are difficult to embody substantial contents. Firstly, segmenting each set in the video set V according to the longest segment duration to obtain a plurality of disassembled segments, then cutting the plurality of disassembled segments according to the background change of the limited 2) video segments, and discarding the cut segments if the cut segment duration is smaller than the shortest segment duration, so that the cut segments are not used for the subsequent construction of personalized head segments and tail segments.
2) Video clip background changes: for each target disassembled segment in a plurality of disassembled segments obtained by segment disassembly of a video set, uniformly extracting frames (equal time intervals) from the target disassembled segments, obtaining K frame disassembled video frames corresponding to the target disassembled segments, respectively constructing respective corresponding vector representations of the K frame disassembled video frames, selecting an intermediate m frame from the K frame disassembled video frames as a reference video frame, and calculating the similarity with the m frames from other extracted frames.
In this embodiment, a ViT model is used to construct respective vector representations of K frame-disassembled video frames, and the ViT model used may be as shown in fig. 7, for each disassembled video frame in the K frame-disassembled video frames, normalize pixel values of pixels of the disassembled video frame, split into patch-region blocks, set (flattened) values of each region block into one-dimensional vectors, input to a transducer-Encoder through a mapping layer, and construct a vector representation of the disassembled video frame.
And carrying out weighted average on the vector representations corresponding to the selected m frame reference video frames to obtain reference vector representations for segment interception corresponding to the target disassembly segment, namely, reference vector representation=avg (the vector representations corresponding to the selected m frame reference video frames), wherein avg () is an average function. And respectively calculating the similarity of each other frame f in the target disassembled segment and the middle m frame representation, wherein the similarity=1-cos (the corresponding vector representation of the frame f and the reference vector representation). If the similarity between a disassembled video frame in the first half of the target disassembled segment and the reference vector representation is lower than a similarity threshold, the disassembled video frame and the disassembled video frame before the disassembled video frame in time sequence are removed from the target disassembled segment. Similarly, if the similarity between a disassembled video frame in the second half of the target disassembled segment and the reference vector representation is below a similarity threshold, then the disassembled video frame and the disassembled video frames that follow the disassembled video frame in time sequence are removed from the target disassembled segment. If the cut-out section corresponding to the obtained target disassembled section is lower than the shortest section duration after the background irrelevant part is removed, the cut-out section is not used for constructing the head section and the tail section of the subsequent section.
1.2, video fragment quality primary selection:
and (3) respectively carrying out segment quality evaluation on the plurality of segments to be processed constructed in the step 1.1, predicting one quality score, removing the segments to be processed with the quality score lower than the quality score threshold value, reducing the selection cost when the dynamic head and tail of the dynamic segment are constructed subsequently, and reserving the segments to be processed with the quality score higher than or equal to the quality score threshold value as candidate video segments. The segments to be processed with lower mass fraction refer to segments to be processed with mass fraction lower than a mass fraction threshold, and the mass fraction threshold can be configured according to actual application scenes. For each target to-be-processed segment of the plurality of to-be-processed segments, a segment quality evaluation result corresponding to the target to-be-processed segment may be obtained through a video segment quality evaluation model as shown in fig. 13.
2. Constructing a target object personalized film head:
when the target object views a certain set in the video set V, a plurality of video clips for the set are selected from the plurality of candidate video clips of the video set V constructed in the step 1, and are generally selected from the previous set of the set, that is, the video range for generating the title is the previous set of the set.
After a plurality of video clips aiming at the set are acquired, selecting a video clip which meets the interest of the target object from the plurality of video clips, and estimating the interest of the target object to the video clip by estimating the playing completion degree of the target object serving as a clip header part in 2.1. Sampling r times from the first r x w video clips with high interest of the target object, selecting w video clips to synthesize r personalized clips each time, namely at least two synthesized video clips, scoring the r personalized clips in 2.2 and 2.3 respectively, calculating the playing completion score of the target object on each synthesized personalized clip respectively, and watching the playing completion of the personalized clip on the video of the album.
The target object is comprehensively selected to have high estimated playing completion degree of the personalized photo, and meanwhile, the personalized photo with high video integrity of the album is played after the personalized photo is watched, so that the real value of the personalized photo can be played. I.e. the estimated playing completion degree of the target object calculated by the integral score=2.2 of the ith personalized slice first ri on the personalized slice first ri is 2.3, and the estimated playing completion degree of the target object calculated by the integral score=2.2 on the video of the album after the personalized slice first ri. And selecting the target personalized film head with high overall score from the personalized film heads of the r constructed target objects, wherein the target personalized film head is used as the target object when the target object watches the video of the set.
2.1, the complete watching probability of the video clip as the head of the clip by the target object:
when the target object watches a certain set in the video set V, a plurality of video clips are selected from the plurality of candidate video clips of the video set V constructed in the step 1 to synthesize a personalized clip of the target object, after the video to be played is determined, a plurality of video clips aiming at the video to be played are obtained from the plurality of candidate video clips of the video set V, aiming at the plurality of video clips, the playing interest degree of the target object on the video clips is estimated through a playing interest degree calculation model of the target object on the video clips, and the playing interest degree determines the playing completion degree of the target object on the video clips as a part of the clip. The playing interest degree calculation model of the target object on the video segment may be shown in fig. 15, where the playing interest degree calculation model of the target object on the video segment includes a text transform-Encoder model, a video segment representation model (including a plurality of ViT models, a text transform-Encoder model and a multi-mode transform-Encoder model) for constructing a corresponding vector representation of the target video segment, a Cross-attribute layer, and a full connection layer. For each target video segment in the plurality of video segments, the server performs video frame extraction on the target video segment to obtain a multi-frame target video frame (shown as N frames in fig. 15) corresponding to the target video segment, constructs respective vector representations (shown as frame 1 vector representation, frame 2 vector representation, … … frame N vector representation) of the N-frame target video frame through a ViT model, obtains descriptive contents (shown as fig. 15 as including words 1, 2 and … … word N) of the target video segment, constructs a vector representation (shown as word 1 vector representation, word 2 vector representation and … … word N vector representation) of descriptive contents of the target video segment through a text transducer-encodder model, performs multi-mode fusion on the respective vector representations of the N-frame target video frame and the descriptive contents of the target video segment through a multi-mode transducer-encodder model, obtains a vector representation corresponding to the target video segment, inputs interest tags (shown as fig. 15 as interest tags 1, interest tags 2 and … … interest tags N) of the target object into a text transducer-encodded, constructs a vector representation (word 1 vector representation, word 2 vector representation and … … word N vector representation) of the corresponding to obtain a full-layer interest vector representation of the target video segment, and the interest layer representation is connected by the corresponding to the interest layer representation of the target video segment, and the interest layer representation is fused by the interest layer representation of the corresponding layer representation. Wherein the multi-modal fusion is performed using the [ SEP ] to separate the respective vector representations of the N frames of the target video frame from the vector representations of the descriptive content of the target video segment.
The playing interest degree calculation model of the target object on the video clip is obtained by training on the data of the object watching head line on the video platform, when training, the head of the video platform is divided into a plurality of clips, if the playing completion degree of a certain clip of a certain object on the head of the video platform is greater than a certain threshold, the clip is taken as a positive playing integrity sample of the object, the other is a negative sample, after the model training is complete, the description content, the image characteristics (namely the video frame characteristics) of the video clip and the selection bias information of the target object can be input, and the playing interest degree (also understood as the playing integrity) of the target object on the video clip is output.
And (3) carrying out interest degree reverse sequence sequencing on a plurality of video clips in a previous set of the video to be played, which is currently watched by the target object, sampling r times from r x w video clips before interest sequencing, selecting w video clips each time, and synthesizing r personalized clip heads. The purpose of this is to construct r dynamic slice header alternatives, and then to select a dynamic slice header that better meets the personalized needs of the target object by comprehensive ordering through 2.2 and 2.3.
2.2, the probability of the complete playing of the personalized clip composed of a plurality of video clips by the target object:
The playing interestingness is calculated for r personalized slice headers serving as alternatives respectively, multi-target joint modeling can be carried out on a playing willingness model of the synthesized slice header through a target object as shown in fig. 16, and meanwhile, the sequence of w video clips when each personalized slice header is synthesized is calculated. The playback willingness model of the target object to the synthesized clip shown in fig. 16 is trained on the original clip and object playback behavior data on the video platform, and the playback completion degree of the object to the video clip meets a certain threshold value is a positive example.
For each personalized slice header of r personalized slice headers, namely, a target synthesized video slice header, a model shown in fig. 16 is firstly constructed, respectively, corresponding vector representations (shown in fig. 16 as w long video slices, the vector representations comprise video slice 1 vector representations, … … video slice s vector representations and … … video slice w vector representations) of each long video slice header (namely, synthesized sub-slices) of the plurality of long video slices included in the personalized slice header are constructed, the corresponding vector representations of the long video slice header can be constructed through a video slice representation model shown in fig. 6, after the corresponding vector representations of the plurality of long video slices are constructed, respectively, the corresponding vector representations of the plurality of long video slices are subjected to a multi-mode-converter-Encoder model, respectively, namely, the corresponding vector representations of the personalized slice header are constructed (shown in fig. 16 as w long video slices, including video slice 1 vector representations, … … video slice s vector representations and … … video slice w vector representations), the performance of the slice header can be improved through modeling, then the interest level (shown in fig. 6) can be combined with the target layer (namely, the interest level of the personalized slice header is 16) of interest (the personalized slice header is completely-interest level-connected with the target layer (namely, the interest level of interest is shown in fig. 16) is achieved by modeling the layer (the personalized slice header is shown in fig. 16).
In addition, when a personalized clip is synthesized by using a plurality of video clips, the playing sequence of each video clip is only determined based on the estimated interest degree of the target object on the content of the video clip, which may lead to the situation that after the personalized clip is synthesized, the overall understanding of the content of the plurality of video clips is lacked, and the playing effect of the synthesized personalized clip is affected, therefore, as shown in fig. 16, the joint modeling is performed on each clip representation of global information of the plurality of clips capturing the whole clip and the interest of the target object after the multimode-converter-Encoder model is performed, the playing interest degree of each clip is estimated again after the target object synthesizes the clip (the interest degree of the target object on the long video clip s is shown in fig. 16) by the Cross-attribute layer and the full connection layer, and the positions of each synthesized sub-clip in the clip are ordered again based on the interest degree. In this embodiment, it should be noted that only the last determined personalized clip may be reordered directly.
2.3, after the target object watches the personalized photo, the probability of continuing playing the current video:
specifically, after the personalized clip is watched by the target object, the playing integrity of the current video set can be calculated through a continuous playing probability prediction model shown in fig. 17, and after the personalized clip is watched by the target object, the playing integrity of the current video set is predicted through joint modeling of the content of the current video set, the personalized clip and the interests of the target object by the continuous playing probability prediction model. The continuous play probability prediction model takes the play condition of the object on the current set of videos as a training sample after the video site watches the head, and if the play integrity of the object on the current set of videos after watching the head meets a certain threshold, the model is a positive example.
For each personalized slice header, as shown in fig. 17, respective vector representations (as shown in fig. 17, a video slice 1 vector representation, a … … video slice s vector representation, and a … … video slice w vector representation which are subjected to a video slice representation model) corresponding to each of a plurality of long video slices (i.e., synthesized sub-slices) included in the personalized slice header are constructed, the respective vector representations of the long video slices can be constructed through a video slice representation model as shown in fig. 6, after the respective vector representations of the plurality of long video slices are constructed, respective vector representations of the plurality of long video slices are subjected to a multi-modal-converter-Encoder model, and a timing-related representation is constructed for the personalized slice header, namely, the vector representation corresponding to the personalized slice header (comprising video slice 1 vector representation, … … video slice s vector representation and … … video slice w vector representation as shown in fig. 17), the modeling capability of the slice header can be improved through such hierarchical modeling, then the vector representation corresponding to the personalized slice header and the vector representation of the interest tag of the target object (comprising interest 1 vector representation, interest 2 vector representation and … … interest N vector representation as shown in fig. 17) are fused through a Cross-attribute layer, the fusion vector representation of the personalized slice header is obtained, the vector representation of the interest tag of the target object and the vector representation corresponding to the current set (as shown in fig. 17, the vector representation corresponding to the current set can be constructed through the L video slices included in the current set), the constructed vector representation corresponding to the current set comprises a video segment 1 vector representation, a video segment 2 vector representation and a … … video segment L vector representation after the multi-mode-converter-encoding model is carried out fusion to obtain a fusion vector representation of the current set, the fusion vector representation of the targeted personalized slice head and the fusion vector representation of the current set are spliced to obtain a target fusion vector representation of the current set, the target fusion vector representation of the current set is input into a full-connection layer, and the playing interest degree (playing completion degree) of a target object on the video of the current set after the dynamic personalized slice head is obtained.
After the steps 2.2 and 2.3, taking the personalized head with the highest overall score (the estimated playing completion degree of the target object calculated by 2.2 to the personalized head ri is calculated by 2.3, and the estimated playing completion degree of the target object to the video of the set after the personalized head ri) as the target object to watch the video of the set.
3. Constructing a target object personalized slice tail:
similar to the dynamic head construction, the construction of the personalized tail of the target object is divided into 3 substeps, and after the target object views the feature content of a certain set of the video set V, the personalized tail of the target object is synthesized by selecting a plurality of video clips from a plurality of candidate video clips of the video set V constructed in the step 1. The personalized footer generally selects video clips from the current set, the previous set and the next set played by the target object, namely, the video range is the current set, the previous set and the next set.
The selected video clips need to conform to the interests of the target objects, and the interests of the target objects to the video clips can be estimated by estimating the playing completion degree of the target objects to the video clips as the clip tail parts in 3.1. Sampling r times from the first r x w video clips with high interest of the target object, selecting w video clips to synthesize r personalized clips each time, namely at least two synthesized video clips, respectively scoring the r personalized clips in 3.2 and 3.3, respectively calculating the playing completion degree score of the target object on each synthesized personalized clip, and playing completion degree of the next video after watching the personalized clip.
The estimated playing completion degree of the target object on the personalized footage is comprehensively selected, and meanwhile, the personalized footage with high video integrity of the next set is played after the personalized footage is watched, so that the real value of the personalized footage can be played. Estimated playing completion degree of target object calculated by the i-th personalized tail ri=3.2 on personalized tail ri, and estimated playing completion degree of target object calculated by 3.3 on next video set after personalized tail ri. And selecting the personalized tail with the highest overall score from the r personalized tails.
3.1, the complete watching probability of the video clip as the tail by the target object:
in the same way as step 2.1, when the target object is about to watch the positive content of a certain set of video set V, selecting a plurality of video clips from the plurality of candidate video clips of the video set V constructed in step 1 to synthesize the personalized footage of the target object, estimating the playing interest degree of the target object on the video clips by using the playing interest degree calculation model of the target object on the video clips as shown in fig. 15, and selecting the video clips for synthesizing the personalized footage based on the playing interest degree.
3.2, the probability of the personalized footage synthesized by a plurality of video clips to be played completely by the target object:
Similar to step 2.2, the play interest of the target object in the synthesized personalized footer can be calculated through the play willingness model of the target object in the dynamic footer as shown in fig. 18, and the position of each video clip in the personalized footer is calculated. In this embodiment, it should be noted that only the last determined personalized footage may be reordered directly.
For each personalized tail in r personalized tails, namely, a target synthesized video clip, a model shown in fig. 18 is firstly constructed, respectively, a corresponding vector representation (shown in fig. 18 as w long video clips, the vector representation comprises a video clip 1 vector representation, a … … video clip s vector representation and a … … video clip w vector representation) of each long video clip (namely, a synthesized sub-clip) included in the personalized tail is constructed, the corresponding vector representation of the long video clip can be constructed through a video clip representation model shown in fig. 6, after the corresponding vector representation of each long video clip is constructed, the corresponding vector representation of each long video clip is subjected to a multi-mode-converter-Encoder model, a time sequence related representation is constructed for the personalized tail to be constructed, namely, the corresponding vector representation (shown in fig. 18 as w long video clips, including a video clip 1 vector representation, a … … video clip s vector representation and a … … video clip w vector representation) of the personalized tail, the interest level modeling can be improved, and then the interest level (shown in fig. 18) of the personalized tail is connected with the target interest level (shown in fig. 18) of the personalized tail is completed, and the interest level (the interest level of the personalized interest level is shown in fig. 18) is completed by modeling the target level (the interest level is shown in fig. 84).
In addition, when a personalized clip is synthesized by using a plurality of video clips, the playing sequence of each video clip is only determined based on the estimated interest degree of the target object on the content of the video clip, which may lead to the situation that after the personalized clip is synthesized, the overall understanding of the content of the plurality of video clips is lacked, and the playing effect of the synthesized personalized clip is affected, therefore, as shown in fig. 18, the joint modeling is performed on each clip representation of global information of the plurality of clips capturing the whole clip and the interest of the target object after the multimode-converter-Encoder model is performed, the playing interest degree of the target object on each clip after the clip is synthesized is estimated again through a Cross-Attention layer and a full connection layer (the interest degree of the target object on the long video clip s is shown in fig. 18), and the positions of each synthesized sub-clip in the clip are ordered again based on the interest degree. In this embodiment, it should be noted that only the last determined personalized footage may be reordered directly.
3.3, after the target object watches the personalized film tail, continuing playing probability of the next video:
similar to step 2.3, the playing interestingness of the target object for the next video set after watching the synthesized personalized film tail can be calculated by the playing willingness model of the target object for the next video set after watching the personalized film tail as shown in fig. 19.
For each personalized slice tail, as shown in fig. 19, respective vector representations (as shown in fig. 19, a video slice 1 vector representation, a … … video slice s vector representation, and a … … video slice w vector representation which are subjected to a video slice representation model) of a plurality of long video slices (i.e., synthesized sub-slices) included in the personalized slice tail are respectively constructed, the respective vector representations of the long video slices can be constructed through a video slice representation model as shown in fig. 6, after the respective vector representations of the plurality of long video slices are constructed, respective vector representations of the plurality of long video slices are subjected to a multi-modal-converter-Encoder model, and a timing-related representation is constructed for the personalized slice tail, i.e. the vector representation corresponding to the personalized footer aimed at (comprising video segment 1 vector representation, … … video segment s vector representation, … … video segment w vector representation as shown in fig. 19), by which hierarchical modeling the modeling capability of the footer can be improved, then the vector representation corresponding to the personalized footer aimed at and the vector representation of the interest tag of the target object (comprising interest 1 vector representation, interest 2 vector representation, … … interest N vector representation as shown in fig. 19) are fused by a Cross-attribute layer, a fused vector representation of the personalized footer aimed at is obtained, and the vector representation of the interest tag of the target object and the vector representation corresponding to the next set by a Cross-attribute layer (as shown in fig. 19, the vector representation corresponding to the next set can be constructed by the L video segments comprised by the next set, the constructed vector representation of the next set comprises a video segment 1 vector representation, a video segment 2 vector representation and a … … video segment L vector representation after the multi-mode-converter-Encoder model is processed through fusion to obtain a fusion vector representation of the next set, the fusion vector representation of the targeted personalized film tail and the fusion vector representation of the next set are spliced to obtain a target fusion vector representation of the next set, the target fusion vector representation of the next set is input into a full-connection layer, and the playing interest (playing completion) of a target object on the next set of video after the dynamic personalized film tail is obtained.
After the steps 3.2 and 3.3, taking the personalized footage with the highest synthesized personalized footage overall score (3.2 calculated estimated playing completion degree of the target object on the personalized footage ri, 3.3 calculated estimated playing completion degree of the target object on the next video after the personalized footage ri) as the target object when watching the video.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a digest video clip generation device for realizing the digest video clip generation method. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation in the embodiments of the apparatus for generating a summarized video segment provided below may be referred to the limitation of the method for generating a summarized video segment hereinabove, which is not repeated herein.
In one embodiment, as shown in fig. 20, there is provided a summarized video clip generating apparatus, comprising: a video clip acquisition module 2002, a video clip synthesis module 2004, a video clip evaluation module 2006, and a video clip selection module 2008, wherein:
the video clip obtaining module 2002 is configured to obtain, when a video to be played of a target object exists, a plurality of video clips for the video to be played;
a video segment synthesis module 2004, configured to determine a video segment set that matches the selection bias information of the target object according to the plurality of video segments, and perform video segment synthesis based on the video segment set, to obtain at least two synthesized video segments;
The video clip evaluation module 2006 is configured to evaluate the playing completion degree of at least two synthesized video clips based on the selection bias information and the video to be played, so as to obtain a corresponding evaluation result of each synthesized video clip;
the video clip selection module 2008 is configured to determine a summary video clip of the target object for the video to be played from the at least two synthesized video clips based on respective evaluation results of the at least two synthesized video clips.
According to the summary video segment generation device, under the condition that the video to be played of the target object exists, a plurality of video segments of the video to be played are obtained, a video segment set matched with selection bias information of the target object is determined according to the plurality of video segments, selection of the video segment set can be achieved by combining the selection bias information, video segment synthesis is conducted on the basis of the video segment set, at least two synthesized video segments are obtained, and acquisition of candidate summary video segments can be achieved.
In one embodiment, the video segment synthesis module is further configured to construct respective vector representations of the plurality of video segments, construct a vector representation of selection bias information of the target object, fuse the respective vector representations of the plurality of video segments with the vector representation of the selection bias information, obtain respective fused vector representations of the plurality of video segments, and determine a video segment set matching the selection bias information of the target object based on the respective fused vector representations of the plurality of video segments.
In one embodiment, the video segment synthesis module is further configured to extract, for each target video segment of the plurality of video segments, a video frame of the target video segment, obtain a multi-frame target video frame corresponding to the target video segment, construct respective vector representations of the multi-frame target video frame, and construct a vector representation of a description content of the target video segment, and perform multi-mode fusion on the respective vector representations of the multi-frame target video frame and the vector representation of the description content of the target video segment, to obtain the respective vector representation of the target video segment.
In one embodiment, the video segment synthesis module is further configured to segment, for each target video frame in the multi-frame target video frames, the target video frame to obtain a plurality of video frame region blocks, respectively perform vector conversion on the plurality of video frame region blocks to obtain a region block vector corresponding to each video frame region block, and encode based on the region block vectors corresponding to the plurality of video frame region blocks to obtain a vector representation of the target video frame.
In one embodiment, the video segment evaluation module is further configured to obtain a vector representation of selection bias information, construct a first vector representation of the video to be played, construct a second vector representation of the target composite video segment for each of the at least two composite video segments, and evaluate a playing completion degree of the target composite video segment based on the vector representation of the selection bias information, the first vector representation and the second vector representation to obtain a corresponding evaluation result of the target composite video segment.
In one embodiment, the video segment evaluation module is further configured to select a plurality of video segments to be played from the video to be played, respectively construct respective vector representations of the plurality of video segments to be played, and fuse the respective vector representations of the plurality of video segments to be played to obtain a first vector representation of the video to be played.
In one embodiment, the video segment evaluation module is further configured to construct respective vector representations of the plurality of synthesized sub-segments, and fuse the respective vector representations of the plurality of synthesized sub-segments to obtain a second vector representation of the target synthesized video segment.
In one embodiment, the video segment evaluation module is further configured to perform a complete play probability evaluation on the target synthesized video segment based on the vector representation and the second vector representation of the selection bias information, obtain a complete play probability of the target synthesized video segment, perform a continuous play probability evaluation on the video to be played based on the vector representation, the first vector representation and the second vector representation of the selection bias information, obtain a continuous play probability of the video to be played, and combine the complete play probability and the continuous play probability to obtain a corresponding evaluation result of the target synthesized video segment.
In one embodiment, the video segment evaluation module is further configured to fuse the vector representation of the selection bias information with the second vector representation to obtain a fused vector representation of the target composite video segment, and perform a complete play probability evaluation based on the fused vector representation of the target composite video segment to obtain a complete play probability of the target composite video segment.
In one embodiment, the video segment evaluation module is further configured to fuse the vector representation of the selection bias information with the first vector representation to obtain a fused vector representation of the video to be played, fuse the vector representation of the selection bias information with the second vector representation to obtain a fused vector representation of the target composite video segment, fuse the fused vector representation of the video to be played with the fused vector representation of the target composite video segment to obtain a target fused vector representation of the video to be played, and evaluate the probability of continuing to play based on the target fused vector representation of the video to be played to obtain the probability of continuing to play the video to be played.
In one embodiment, the summarized video clip comprises a plurality of summarized sub-clips; the video segment selection module is further configured to obtain a vector representation of the summarized video segment and a vector representation of the selection bias information, the vector representation of the summarized video segment includes vector representations corresponding to each of the summarized sub-segments, fuse the vector representations corresponding to each of the summarized sub-segments with the vector representation of the selection bias information to obtain fused vector representations corresponding to each of the summarized sub-segments, sort the segments based on the fused vector representations corresponding to each of the summarized sub-segments to obtain a segment sorting result, and adjust the order of the summarized sub-segments in the summarized video segment according to the segment sorting result to obtain the target summarized video segment.
In one embodiment, the video clip obtaining module is further configured to determine a video set to which the video to be played belongs and a position of the video to be played in the video set, determine a video range for generating the summary video clip in the video set based on the position of the video to be played in the video set, and select a plurality of video clips of the video range from a plurality of candidate video clips of the pre-generated video set.
In one embodiment, the plurality of candidate video segments are obtained by performing video segment quality analysis on the video set, the video segment obtaining module is further configured to perform segment disassembly on the video set to obtain a plurality of to-be-processed segments, perform segment quality evaluation on the plurality of to-be-processed segments respectively, obtain a segment quality evaluation result corresponding to each to-be-processed segment, and determine a plurality of candidate video segments from the plurality of to-be-processed segments based on the segment quality evaluation results corresponding to the plurality of to-be-processed segments respectively.
In one embodiment, the video segment obtaining module is further configured to segment-disassemble the video set to obtain a plurality of disassembled segments, extract video frames of the disassembled segments to obtain a plurality of frame-disassembled video frames corresponding to each disassembled segment, segment-intercept the target disassembled segment based on the frame-disassembled video frames corresponding to the target disassembled segment for each target disassembled segment of the disassembled segments to obtain an intercepted segment corresponding to the target disassembled segment, and determine a plurality of segments to be processed from the intercepted segments corresponding to each of the disassembled segments.
In one embodiment, the video segment obtaining module is further configured to construct respective vector representations of multiple frames of disassembled video frames corresponding to the target disassembled segment, determine a video frame for segment interception from the multiple frames of disassembled video frames based on the respective vector representations of the multiple frames of disassembled video frames, take the video frame for segment interception as a segment interception node, and perform segment interception on the target disassembled segment to obtain an intercepted segment corresponding to the target disassembled segment.
In one embodiment, the video segment obtaining module is further configured to select a multi-frame reference video frame from the multi-frame disassembled video frames, perform weighted average on vector representations corresponding to each of the multi-frame reference video frames, obtain reference vector representations corresponding to the target disassembled segments and used for segment interception, respectively determine similarity between the reference vector representations and vector representations corresponding to each of the multi-frame disassembled video frames, and use the target vector representations as the disassembled video frames for segment interception; the similarity of the target vector representation to the reference vector representation is less than a similarity threshold.
In one embodiment, the video segment obtaining module is further configured to perform video frame extraction on each target to-be-processed segment in the multiple to-be-processed segments, obtain multiple frames of to-be-processed video frames corresponding to the target to-be-processed segments, obtain description contents of the target to-be-processed segments, construct respective vector representations of the multiple frames of to-be-processed video frames, construct vector representations of the description contents of the target to-be-processed segments, perform multi-mode fusion on the respective vector representations of the multiple frames of to-be-processed video frames and the vector representations of the description contents of the target to-be-processed segments, obtain the respective vector representations of the target to-be-processed segments, and perform segment quality evaluation based on the respective vector representations of the target to-be-processed segments, so as to obtain segment quality evaluation results corresponding to the target to-be-processed segments.
In one embodiment, the summary video segment generating device further includes a video segment playing module, where the video segment playing module is configured to play the summary video segment and the video to be played according to a play order of the summary video segment and the video to be played, which are specified by a type of the summary video segment, when the video to be played is required to be played.
The respective modules in the above-described digest video clip generation apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device, which may be a server or a terminal, is provided, and in this embodiment, an example in which the computer device is a server is described, and an internal structure thereof may be as shown in fig. 21. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing candidate video clips and the like. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of summary video clip generation.
It will be appreciated by persons skilled in the art that the architecture shown in fig. 21 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the object information (including, but not limited to, selection bias information and the like) and the data (including, but not limited to, data for analysis, stored data, presented data and the like) related to the present application are information and data authorized by the object or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (20)

1. A method for generating a summarized video clip, the method comprising:
when a video to be played of a target object exists, acquiring a plurality of video clips aiming at the video to be played;
determining a video segment set matched with the selection bias information of the target object according to the plurality of video segments, and synthesizing video segments based on the video segment set to obtain at least two synthesized video segments;
Based on the selection bias information and the video to be played, respectively carrying out playing completion degree evaluation on the at least two synthesized video clips to obtain a corresponding evaluation result of each synthesized video clip;
and determining the abstract video segments of the target object aiming at the video to be played from the at least two synthesized video segments based on the respective evaluation results of the at least two synthesized video segments.
2. The method of claim 1, wherein determining a set of video segments that match selection bias information for the target object from the plurality of video segments comprises:
respectively constructing vector representations corresponding to the video clips, and constructing vector representations of selection bias information of the target object;
fusing the vector representations corresponding to the video clips with the vector representations of the selection bias information respectively to obtain fused vector representations corresponding to the video clips;
and determining a video segment set matched with the selection bias information of the target object based on the respective fusion vector representations of the video segments.
3. The method of claim 2, wherein constructing respective vector representations of each of the plurality of video segments comprises:
extracting video frames of each target video segment in the plurality of video segments to obtain multi-frame target video frames corresponding to the target video segments;
constructing respective vector representations of the multi-frame target video frames, and constructing vector representations of descriptive contents of the target video segments;
and carrying out multi-mode fusion on the vector representations corresponding to the multi-frame target video frames and the vector representations of the descriptive contents of the target video segments to obtain the vector representations corresponding to the target video segments.
4. The method of claim 1, wherein the performing playback completion evaluation on the at least two synthesized video clips based on the selection bias information and the video to be played respectively, and obtaining a corresponding evaluation result for each synthesized video clip comprises:
acquiring vector representation of the selection bias information, and constructing a first vector representation of the video to be played;
constructing a second vector representation of each of the at least two synthesized video segments for the target synthesized video segment;
And carrying out playing completion degree evaluation on the target synthesized video segment based on the vector representation of the selection bias information, the first vector representation and the second vector representation to obtain a corresponding evaluation result of the target synthesized video segment.
5. The method of claim 4, wherein constructing the first vector representation of the video to be played comprises:
selecting a plurality of video clips to be played from the video to be played, and respectively constructing respective vector representations of the video clips to be played;
and fusing the vector representations corresponding to the video clips to be played respectively to obtain a first vector representation of the video to be played.
6. The method of claim 4, wherein the target composite video clip comprises a plurality of composite sub-clips; the constructing a second vector representation of the target composite video segment includes:
respectively constructing respective vector representations of the plurality of synthesis subfragments;
and fusing the vector representations corresponding to the synthesis sub-segments respectively to obtain a second vector representation of the target synthesis video segment.
7. The method of claim 4, wherein performing playback completion evaluation on the target composite video segment based on the vector representation of the selection bias information, the first vector representation, and the second vector representation, and obtaining a corresponding evaluation result for the target composite video segment comprises:
Based on the vector representation of the selection bias information and the second vector representation, carrying out complete play probability evaluation on the target synthesized video segment to obtain complete play probability of the target synthesized video segment;
based on the vector representation of the selection bias information, the first vector representation and the second vector representation, carrying out continuous play probability evaluation on the video to be played, and obtaining continuous play probability of the video to be played;
and combining the complete playing probability and the continuous playing probability to obtain a corresponding evaluation result of the target synthesized video clip.
8. The method of claim 7, wherein the performing a full play probability evaluation on the target composite video segment based on the vector representation of the selection bias information and the second vector representation, the obtaining the full play probability of the target composite video segment comprises:
fusing the vector representation of the selection bias information and the second vector representation to obtain a fused vector representation of the target synthesized video segment;
and carrying out complete play probability evaluation based on the fusion vector representation of the target synthesized video segment to obtain the complete play probability of the target synthesized video segment.
9. The method of claim 7, wherein the evaluating the probability of continuing to play the video to be played based on the vector representation of the selection bias information, the first vector representation, and the second vector representation, the obtaining the probability of continuing to play the video to be played comprises:
fusing the vector representation of the selection deflection information and the first vector representation to obtain a fused vector representation of the video to be played;
fusing the vector representation of the selection bias information and the second vector representation to obtain a fused vector representation of the target synthesized video segment;
fusing the fusion vector representation of the video to be played and the fusion vector representation of the target synthesized video segment to obtain a target fusion vector representation of the video to be played;
and carrying out continuous play probability evaluation based on the target fusion vector representation of the video to be played, and obtaining the continuous play probability of the video to be played.
10. The method of claim 1, wherein the summarized video clip comprises a plurality of summarized sub-clips; after determining the summary video segment of the target object for the video to be played from the at least two synthesized video segments based on the respective evaluation results of the at least two synthesized video segments, the method further includes:
Acquiring a vector representation of the summarized video segment and a vector representation of the selection bias information; the vector representations of the summarized video segments include respective vector representations of the plurality of summarized sub-segments;
fusing the vector representations corresponding to the plurality of abstract sub-segments with the vector representations of the selection bias information respectively to obtain fused vector representations corresponding to the plurality of abstract sub-segments;
segment sorting is carried out based on the fusion vector representations corresponding to the summary sub-segments respectively, and a segment sorting result is obtained;
and according to the segment sequencing result, the sequence of the plurality of abstract sub-segments in the abstract video segments is adjusted to obtain a target abstract video segment.
11. The method of claim 1, wherein the obtaining a plurality of video clips for the video to be played comprises:
determining a video set to which the video to be played belongs and the position of the video to be played in the video set;
determining a video range for generating a summary video segment in the video set based on the position of the video to be played in the video set;
And selecting a plurality of video clips of the video range from a plurality of candidate video clips of the pre-generated video set.
12. The method of claim 11, wherein the plurality of candidate video clips are obtained by video clip quality analysis of the video collection, the video clip quality analysis of the video collection comprising:
segment disassembly is carried out on the video set to obtain a plurality of segments to be processed;
respectively carrying out segment quality evaluation on the plurality of segments to be processed to obtain a segment quality evaluation result corresponding to each segment to be processed;
and determining the candidate video clips from the clips to be processed based on the respective clip quality evaluation results of the clips to be processed.
13. The method of claim 12, wherein the segment disassembly of the video collection to obtain a plurality of segments to be processed comprises:
segment disassembly is carried out on the video set to obtain a plurality of disassembled segments;
extracting video frames from the plurality of disassembled segments respectively to obtain multi-frame disassembled video frames corresponding to each disassembled segment;
Aiming at each target dismantling segment in the plurality of dismantling segments, based on multi-frame dismantling video frames corresponding to the target dismantling segment, carrying out segment interception on the target dismantling segment to obtain an intercepting segment corresponding to the target dismantling segment;
and determining a plurality of fragments to be processed from the intercepted fragments corresponding to each of the plurality of disassembled fragments.
14. The method of claim 13, wherein the performing segment truncation on the target disassembled segment based on the multi-frame disassembled video frame corresponding to the target disassembled segment, to obtain the truncated segment corresponding to the target disassembled segment comprises:
constructing respective vector representations of multi-frame disassembled video frames corresponding to the target disassembled segments;
determining video frames for segment interception from the multi-frame disassembled video frames based on respective vector representations of the multi-frame disassembled video frames;
and taking the video frame for segment interception as a segment interception node, and carrying out segment interception on the target disassembled segment to obtain an intercepted segment corresponding to the target disassembled segment.
15. The method of claim 14, wherein the determining video frames for segment truncation from the multi-frame disassembled video frames based on respective vector representations of the multi-frame disassembled video frames comprises:
Selecting a multi-frame reference video frame from the multi-frame disassembled video frames, and carrying out weighted average on vector representations corresponding to the multi-frame reference video frames to obtain reference vector representations corresponding to the target disassembled segments and used for segment interception;
respectively determining the similarity between the reference vector representation and the vector representations corresponding to the multiple frames of disassembled video frames;
the target vector represents a corresponding disassembled video frame and is used as a video frame for segment interception; the similarity of the target vector representation to the reference vector representation is less than a similarity threshold.
16. The method of claim 12, wherein the performing segment quality evaluation on the plurality of segments to be processed, respectively, to obtain a corresponding segment quality evaluation result for each segment to be processed comprises:
for each target to-be-processed segment in the plurality of to-be-processed segments, extracting video frames of the target to-be-processed segment to obtain multi-frame to-be-processed video frames corresponding to the target to-be-processed segment, and obtaining description contents of the target to-be-processed segment;
constructing respective vector representations of the multi-frame to-be-processed video frames, and constructing vector representations of description contents of the target to-be-processed fragments;
Carrying out multi-mode fusion on the vector representations corresponding to the multi-frame to-be-processed video frames and the vector representations of the descriptive contents of the target to-be-processed fragments to obtain the vector representations corresponding to the target to-be-processed fragments;
and carrying out segment quality evaluation based on the vector representation corresponding to the target segment to be processed, and obtaining a segment quality evaluation result corresponding to the target segment to be processed.
17. A digest video clip generation apparatus, the apparatus comprising:
the video segment acquisition module is used for acquiring a plurality of video segments aiming at the video to be played when the video to be played of the target object exists;
the video segment synthesis module is used for determining a video segment set matched with the selection bias information of the target object according to the plurality of video segments, and synthesizing video segments based on the video segment set to obtain at least two synthesized video segments;
the video segment evaluation module is used for respectively evaluating the playing completion degree of the at least two synthesized video segments based on the selection deflection information and the video to be played, and obtaining a corresponding evaluation result of each synthesized video segment;
And the video clip selection module is used for determining the abstract video clip of the target object aiming at the video to be played from the at least two synthesized video clips based on the respective evaluation results of the at least two synthesized video clips.
18. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 16 when the computer program is executed.
19. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 16.
20. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 16.
CN202310077114.5A 2023-01-13 2023-01-13 Method, device, computer equipment and storage medium for generating abstract video clips Pending CN116980645A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310077114.5A CN116980645A (en) 2023-01-13 2023-01-13 Method, device, computer equipment and storage medium for generating abstract video clips

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310077114.5A CN116980645A (en) 2023-01-13 2023-01-13 Method, device, computer equipment and storage medium for generating abstract video clips

Publications (1)

Publication Number Publication Date
CN116980645A true CN116980645A (en) 2023-10-31

Family

ID=88481998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310077114.5A Pending CN116980645A (en) 2023-01-13 2023-01-13 Method, device, computer equipment and storage medium for generating abstract video clips

Country Status (1)

Country Link
CN (1) CN116980645A (en)

Similar Documents

Publication Publication Date Title
CN112163122B (en) Method, device, computing equipment and storage medium for determining label of target video
CN107707931B (en) Method and device for generating interpretation data according to video data, method and device for synthesizing data and electronic equipment
CN111581437A (en) Video retrieval method and device
JP2020017295A (en) Video data processing method and device, and readable storage medium
CN110781347A (en) Video processing method, device, equipment and readable storage medium
CN110856042A (en) Video playing method and device, computer readable storage medium and computer equipment
CN111209440A (en) Video playing method, device and storage medium
US11868738B2 (en) Method and apparatus for generating natural language description information
WO2022188644A1 (en) Word weight generation method and apparatus, and device and medium
CN114390218B (en) Video generation method, device, computer equipment and storage medium
CN111460979A (en) Key lens video abstraction method based on multi-layer space-time frame
CN111783712A (en) Video processing method, device, equipment and medium
CN111479130A (en) Video positioning method and device, electronic equipment and storage medium
CN112989212B (en) Media content recommendation method, device and equipment and computer storage medium
CN116935170B (en) Processing method and device of video processing model, computer equipment and storage medium
CN113766299A (en) Video data playing method, device, equipment and medium
CN117609550B (en) Video title generation method and training method of video title generation model
CN112016406A (en) Video key frame extraction method based on full convolution network
Jayanthiladevi et al. AI in video analysis, production and streaming delivery
CN114363695B (en) Video processing method, device, computer equipment and storage medium
Zhang et al. SOR-TC: Self-attentive octave ResNet with temporal consistency for compressed video action recognition
CN112101154A (en) Video classification method and device, computer equipment and storage medium
CN115171014B (en) Video processing method, video processing device, electronic equipment and computer readable storage medium
CN116543339A (en) Short video event detection method and device based on multi-scale attention fusion
CN115909390B (en) Method, device, computer equipment and storage medium for identifying low-custom content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication