CN117835004A

CN117835004A - Method, apparatus and computer readable medium for generating video viewpoints

Info

Publication number: CN117835004A
Application number: CN202410020695.3A
Authority: CN
Inventors: 李方亮; 程伟; 蔡春磊; 王媛媛; 汤然
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2024-01-05
Filing date: 2024-01-05
Publication date: 2024-04-05

Abstract

A method, apparatus, and computer-readable medium for generating a video viewpoint are provided. The method according to the application comprises the following steps: obtaining point of view reference information of a target video, wherein the point of view reference information comprises a two-creation video and/or subtitle information of the target video matched with the target video; analyzing and processing the target video based on the viewpoint reference information to obtain a plurality of target viewpoint fragments contained in the target video; and generating the viewpoint information of the target video based on the target viewpoint fragment. According to the method and the device, the target video is analyzed and processed based on the target video matched two-creation video or the subtitle information of the target video, so that the highlight fragments or the high-energy points contained in the target video are automatically obtained, and compared with the mode of manually identifying the high-energy points of the video, the efficiency is greatly improved.

Description

Method, apparatus and computer readable medium for generating video viewpoints

Technical Field

The present invention relates to the field of computer technology, and in particular, to a method, an apparatus, and a computer readable medium for generating a video viewpoint.

Background

Currently, as videos become a main mode of leisure and entertainment and information transfer for people, more and more video APP can play an inline in a list, and the inline play refers to video play in a Feed stream (an information stream continuously updated and presented to user content), that is, when a user browses videos in a corresponding Feed stream, the corresponding videos are automatically played. In the process of browsing long videos by users, the product hopes to present highlight clips in the videos to play so as to attract users to click and watch the corresponding videos.

Based on prior art schemes, it is common practice to manually operate to mark some high energy points of view for video. However, the manual method for identifying the high-energy point of view has high labor cost, and only few important contents can be identified, and the method for analyzing and deciding through the user interaction data has the defect that the high-energy point of view cannot be decided in time for the non-popular contents with less interaction or the new contents with insufficient data accumulated.

Disclosure of Invention

Aspects of the present application provide a method, apparatus, and computer-readable medium for generating a video viewpoint.

In one aspect of the present application, a method for generating a video viewpoint is provided, wherein the method includes:

Obtaining point of view reference information of a target video, wherein the point of view reference information comprises a two-creation video and/or subtitle information of the target video matched with the target video;

analyzing and processing the target video based on the viewpoint reference information to obtain a plurality of target viewpoint fragments contained in the target video;

and generating the viewpoint information of the target video based on the target viewpoint fragment.

In one aspect of the present application, an apparatus for generating a video viewpoint is provided, where the apparatus includes:

the device is used for acquiring viewpoint reference information of the target video, wherein the viewpoint reference information comprises two-creation video matched with the target video and/or subtitle information of the target video;

means for obtaining a plurality of target viewpoint segments contained in a target video by analyzing the target video based on the viewpoint reference information;

and generating viewpoint information of the target video based on the target viewpoint segment.

In another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the claimed embodiments.

In another aspect of the present application, a computer-readable storage medium having stored thereon computer program instructions executable by a processor to implement a method of an embodiment of the application is provided.

In the scheme provided by the embodiment of the application, the target video is analyzed and processed based on the two-creation video matched with the target video or the subtitle information of the target video, so that a highlight fragment or a high-energy point contained in the target video is automatically obtained, and compared with a mode of manually identifying the high-energy point of the video, the efficiency is greatly improved; according to the method for generating the video viewpoint based on the target video matched two-creation video, the two-creation content data are fully utilized, and particularly, the high-energy fragments or the highlight fragments of the video can be rapidly and accurately positioned for the long video with larger flow; according to the method for generating the video viewpoint based on the subtitle information, the viewpoint extraction can be performed on various types of videos, and the accuracy of the finally generated video viewpoint is improved through the method for evaluating dynamic scenes, aesthetic feature analysis and the like in the video segments for generating the video viewpoint.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 shows a schematic flow chart for generating video viewpoints according to an embodiment of the present application;

FIG. 2 illustrates a flow diagram of a method of generating video points of view based on target video matched two-shot video in accordance with an embodiment of the present application;

FIG. 3 illustrates a schematic diagram of an exemplary high-energy viewpoint extraction scheme based on a two-shot video;

FIG. 4 is a flow chart illustrating a method of generating video points of view based on caption information according to an embodiment of the present application;

FIG. 5 illustrates a schematic diagram of an exemplary multi-modal high-energy point of view extraction scheme;

FIG. 6 is a schematic structural diagram of an apparatus for generating a video viewpoint according to an embodiment of the present application;

FIG. 7 shows a schematic structural diagram of an apparatus for generating video viewpoints based on target video matched two-shot video according to an embodiment of the present application;

fig. 8 is a schematic diagram illustrating a structure of an apparatus for generating a video viewpoint based on subtitle information according to an embodiment of the present application;

fig. 9 shows a schematic structural diagram of an apparatus suitable for implementing the solution in the embodiments of the present application.

The same or similar reference numbers in the drawings refer to the same or similar parts.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

In a typical configuration of the present application, the terminals, the devices of the services network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer-readable media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer program instructions, data structures, modules of the program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device.

Fig. 1 shows a flowchart of a method for generating a video viewpoint according to an embodiment of the present application. The method at least comprises step S101, step S102 and step S103.

In an actual scenario, the execution body of the method may be a user device, or a device formed by integrating the user device and a network device through a network, or may also be an application running on the device, where the user device includes, but is not limited to, various terminal devices such as a computer, a mobile phone, a tablet computer, a smart watch, a bracelet, and the network device includes, but is not limited to, a network host, a single network server, a plurality of network server sets, or a computer set based on cloud computing, where the network device is implemented, and may be used to implement a part of processing functions when setting an alarm clock. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.

In the context of a video website, the subject of execution of the method may be a server for generating video viewpoints. The video viewpoint corresponds to a highlight or high energy segment of the video.

Referring to fig. 1, in step S101, viewpoint reference information of a target video is acquired.

The target video is a video of a video viewpoint to be generated. Optionally, the target video is a long video.

The viewpoint reference information comprises subtitle information of the two-creation video and/or the target video matched with the target video.

The process of acquiring the two-shot video matching with the target video and the process of acquiring the subtitle information of the target video will be described below with reference to the flowcharts of the methods shown in fig. 2 and 3, respectively.

In step S102, a plurality of target viewpoint segments included in the target video are obtained by performing an analysis process on the target video based on the viewpoint reference information.

The target viewpoint fragment is a fragment of a video viewpoint for determining a target video.

Optionally, the target viewpoint segment is a high energy segment or highlight segment of the target video.

According to one embodiment, the viewpoint reference information is a two-creation video matched with the target video, and the method obtains the target viewpoint fragment of the target video by analyzing the coincident video fragment of the two-creation video and the target video. Step S101 of the present embodiment includes step S1011, and step S102 includes step S1021 and step S1022. The processes of the above-described step S1011, step S10211, and step S1022 will be described in detail later with reference to a method flowchart shown in fig. 2.

According to one embodiment, the viewpoint reference information is caption information matched with the target video, and the method analyzes the caption information by using a language model to obtain a target viewpoint fragment of the target video. Step S101 of the present embodiment includes step S1012, and step S102 includes step S1023, step S1024, and step S1025. The processes of the above-described step S1012, step S1023, step S1024 and step S1025 will be described in detail later with reference to a method flowchart shown in fig. 3.

Next, in step S103, video viewpoint information of the target video is generated based on the target viewpoint segment.

The viewpoint information is used for identifying information of high-energy fragments or highlight fragments contained in the target video.

Optionally, the viewpoint information includes play time point information, such as timestamp information, corresponding to one or more high-energy clips or highlight clips.

Optionally, the viewpoint information further includes text description information corresponding to the viewpoint.

The text description information of the viewpoint can be generated by language processing of the corresponding video clips. For example, a predetermined language model is used to generate a content summary of a video clip based on text information corresponding to the video clip, and so on.

Alternatively, the textual description information may be generated based on content entered by the user.

According to the method, the method for automatically obtaining the highlight or the high-energy viewpoint contained in the target video by analyzing and processing the target video based on the target video matched two-creation video or the subtitle information of the target video is provided, and compared with the method for manually identifying the high-energy viewpoint of the video, the efficiency is greatly improved.

Fig. 2 shows a flow diagram of a method of generating a video viewpoint based on target video matched two-shot video according to an embodiment of the present application.

The method shown in fig. 2 includes step S1011, step S1021, step S1022, and step S103.

Referring to fig. 2, in step S1011, a plurality of target video sequences matching the target video are obtained by searching based on fingerprint information of the target video.

Wherein the target video comprises one or more video clips coincident with the target video. The target secondary video is an original video obtained based on user generated content (User Generated Content, UGC). Optionally, the target two-creation video is a middle-short video.

Wherein the fingerprint information includes various information generated based on the video content that can uniquely identify the video, such as a string of fingerprint characters that can uniquely identify the current video.

Optionally, in step S1011, the target video fingerprint library is searched based on the fingerprint information of the target video, so as to obtain a plurality of target two-creation videos matched with the target video, and the method according to the present embodiment further includes step S104 and step S105.

In step S104, fingerprint information corresponding to a plurality of in-library videos is obtained based on a predetermined fingerprint extraction algorithm, where the in-library videos include original videos and di-original videos.

Those skilled in the art should be familiar with the technology of extracting video fingerprints based on various fingerprint extraction algorithms, for example, fingerprint extraction algorithms using convolutional neural network (Convolutional Neural Networks, CNN) or Long Short-Term Memory (LSTM) and the like, and can select an appropriate fingerprint extraction algorithm to obtain fingerprint information corresponding to video based on actual requirements.

In step S105, a video fingerprint library is established based on the video fingerprint information of the in-library video.

Optionally, the method according to the present embodiment selects the plurality of matched two-shot videos, and the step S1011 further includes a step S10111 and a step S10112.

In step S10111, the search is performed based on the fingerprint information of the target video, so as to obtain a plurality of candidate two-creation videos matched with the target video.

In step S10112, the candidate two-shot video is screened based on a predetermined selection condition, and the two-shot video satisfying the selection condition is taken as a target two-shot video.

Optionally, the selection condition includes at least any one of the following:

1) The video play amount exceeds a predetermined play amount threshold;

2) The video playing time exceeds a preset time threshold;

3) The video interaction data meets predetermined requirements. For example, the number of endorsements, favorites, or comments of the video is greater than a predetermined threshold, etc.

Optionally, the obtained plurality of candidate two-creation videos are ranked according to a predetermined ranking rule, so that a predetermined number of candidate two-creation videos with the top ranking are selected as target two-creation videos.

Wherein the ranking rules may be formulated based on a variety of data that may be used to evaluate video quality.

For example, each candidate two-shot video is scored based on the data of the play amount, the praise number, the rating number, and the like of the videos, so that each candidate two-shot video is ranked based on the score.

As the target videos are possibly numerous in number, the two-creation videos with better quality are selected in the selection and sequencing mode to be used for determining the video viewpoint, and the accuracy of the generated video viewpoint is improved.

Continuing with the description of fig. 2, in step S1021, a plurality of overlapping video segments, in which the plurality of target two-shot videos and the target video are matched, are acquired.

Specifically, for each target two-creation video, a video frame set of the target two-creation video is obtained by performing frame cutting processing on the target two-creation video. Next, similarity information of each video frame in the set of video frames to the target video is calculated based on a predetermined similarity calculation rule. Wherein the similarity information includes various information that can be used to indicate the degree of similarity between video frames, such as mean square error, cosine similarity, or euclidean distance, etc. And then, carrying out similarity matching on the video frames of the video frame set and the target video based on the similarity information, so as to obtain one or more coincident video fragments of the target two-creation video and the target video based on a similarity matching result.

And if the matching video frames contained in the similarity matching result correspond to one or more playing fragments, directly taking the fragments corresponding to the one or more playing fragments as the coincident video fragments. If the matching video frames contained in the similarity matching result correspond to one or more playing time points, for each playing time point, a rule is determined based on a preset segment, and the playing segment with the preset duration and containing the playing time point is taken as the coincident video segment.

For example, assume that a video clip 60 seconds after a play time point is taken as a corresponding coincident video clip based on a predetermined clip determination rule. And if the play time point is increased by 60 seconds to exceed the total duration of the target two-creation video, taking the section from the play time point to the end of the target two-creation video as the coincident video section. Assuming that for a target two-creation video with a duration of 5 minutes, the video frame of the two-creation video is subjected to similarity matching with the target video, and the obtained matching result corresponds to 1 minute and 23 seconds of the playing time point. Based on a predetermined segment rule, if the total time length of the target two-creation video is not exceeded when the 1 minute 23 second is increased by 60 seconds, the video segment 60 seconds after the 1 minute 23 second in the target two-creation video is taken as the coincident video segment.

Next, in step S1022, a filtering process is performed on the plurality of overlapping video clips, so that the overlapping video clip obtained after filtering is used as the target viewpoint clip.

Optionally, the method performs screening processing on the multiple coincident video segments in a manner including, but not limited to, at least any one of the following:

1) Screening processing is carried out based on the play area of the coincident video segment in the target video;

Specifically, the method obtains playing area information corresponding to the multiple coincident video clips in the target video; and then, based on the playing area information, eliminating the overlapped video clips of which the playing areas belong to the preset target elimination areas.

Wherein the target exclusion area comprises a head, a tail and/or an advertisement segment of the target video.

2) Screening based on the play time length of the coincident video clips;

specifically, the method acquires playing time length information corresponding to the multiple coincident video clips; and then, based on the playing time length information, eliminating the coincident video clips with the playing time length smaller than a preset threshold value.

3) Screening based on the content of the coincident video segments;

optionally, the method determines whether the overlapping video segments contain offending content, such as soft pornography or bloody violence, based on predetermined content auditing rules, thereby excluding overlapping video segments containing offending content.

Next, in step S103, one or more high-energy viewpoint information of the target video is generated based on the target viewpoint segment.

The process of step S103 is described in the foregoing, and is not described in detail herein.

The method of the present embodiment will be described with reference to fig. 3.

Fig. 3 shows a schematic diagram of an exemplary high-energy viewpoint extraction scheme based on a two-shot video. And, the method shown in fig. 3 is performed by a server of the video website for extracting high-energy viewpoints of long videos.

Referring to fig. 3, in the preparation phase, for a long video in the long video database of the video website and a UGC two-creation video of a video UP master in the UGC database, corresponding video fingerprints are extracted based on a predetermined fingerprint extraction algorithm to establish a video fingerprint library.

And for the long video to be processed, performing fingerprint retrieval in a video fingerprint library based on the video fingerprint of the long video to obtain corresponding matched video data, wherein the matched video data comprises a plurality of UGC (user generated content) two-time short videos matched with the long video through filtering. The filtering process is based on the playing amount, playing time length and other video interaction data of UGC secondary medium short videos, and the result videos are sequenced, so that the secondary medium short videos with higher quality are selected as matching video data.

And then, carrying out frame cutting processing on the two-shot medium-short video contained in the matched video data, and calculating the matching segment time of the two-shot medium-short video and the long video, thereby obtaining a plurality of overlapping segments of the two-shot medium-short video and the long video.

And then, screening the plurality of coincident fragments based on a preset screening rule, so as to eliminate the coincident fragments positioned at the head or tail of the long video, with too short playing time or content related violations and the like.

Next, a high-energy viewpoint of the long video is generated based on the overlapping segments obtained after the screening.

According to the method of the embodiment, the two-invasive content data are fully utilized, particularly, long videos with large flow can be rapidly and accurately positioned, and the high-energy fragments or the highlight fragments of the videos can be rapidly and accurately positioned; and compared with a mode of manually identifying high-energy points of view of the video, the efficiency is greatly improved.

Fig. 4 shows a flow diagram of a method of generating video points of view based on subtitle information according to an embodiment of the present application.

The method shown in fig. 4 includes step S1012, step 2023, step S2024, step S2025, and step S103.

Referring to fig. 4, in step S1012, subtitle information of a target video is acquired.

The method for acquiring the subtitle information of the target video includes, but is not limited to, at least any one of the following:

1) Acquiring the plug-in subtitle information of a target video;

2) Acquiring caption information by performing caption identification processing on a target video; alternatively, the method may recognize characters corresponding to a caption area in a video picture as caption information by performing optical character recognition (Optical Character Recognition, OCR) on a video frame of a target video. Alternatively, the method may recognize text content in the target video speech as subtitle information through an automatic speech recognition technique (Automatic Speech Recognition, ASR) technique.

In step S1023, language analysis is performed on the subtitle related information by using a predetermined language model, so as to obtain corresponding viewpoint indication information.

Wherein the viewpoint indication information includes various information that can be used to indicate a high-energy clip or highlight contained in the target video. For example, a play time point or time stamp information corresponding to the high-energy clip or highlight.

Wherein the language model includes various models that can be used to perform natural language processing tasks.

Optionally, the language model is a large language model (Large Language Model, LLM) that refers to a deep learning model trained using large amounts of text data, which can generate natural language text or understand the meaning of language text.

Optionally, the method according to the present embodiment further comprises step S106 and step S107.

In step S106, trimming processing data for performing trimming processing is acquired.

Wherein the fine tuning process data includes, but is not limited to, a work name, style, profile, etc. associated with the target video.

In step S107, the language model is subjected to a trimming process based on the trimming process data.

Optionally, the method performs fine tuning by means of a user prompting fine tuning (P-tuning) in S107. The user prompts fine-tuning the preset language model to have sufficient prior knowledge, and the language model is only required to better understand the user instructions by adjusting an editor of the user prompt to execute related tasks.

Optionally, the method performs segmentation processing on the caption information to obtain multiple pieces of caption information, and uses the language model to process each piece of caption information respectively.

Next, in step S1024, a plurality of candidate viewpoint segments included in the target video are determined based on the viewpoint indication information.

In step S1025, the plurality of candidate viewpoint segments are screened by performing an evaluation process on the plurality of candidate viewpoint segments, so that the candidate viewpoint segment obtained after the screening is used as a target viewpoint segment.

Wherein the means for screening the plurality of candidate viewpoint segments by performing an evaluation process on the plurality of candidate viewpoint segments includes, but is not limited to, at least any one of:

1) Evaluating the dynamicity of the candidate viewpoint fragments by performing dynamic scene evaluation, thereby screening the candidate viewpoint fragments based on a dynamic scene evaluation result;

The method adopts the scene switching frequency to evaluate the dynamic property of the video clips, and screens out candidate viewpoint clips with high scene switching frequency through the dynamic scene evaluation result.

Specifically, the method may calculate pixel differences between successive frames of candidate video clips by a threshold-based Scene cut (Scene cut) method, and determine whether Scene switching occurs based on the pixel differences and a predetermined difference threshold, thereby obtaining Scene switching frequencies corresponding to respective candidate video clips.

2) Screening the plurality of candidate viewpoint segments based on the aesthetic feature evaluation result by performing aesthetic feature evaluation on the plurality of candidate viewpoints;

wherein the aesthetic feature evaluation is performed in such a manner as to evaluate the degree of viewing suitability of the video clip from an aesthetic point of view, for example, if there is a picture unsuitable for recommendation to the user such as clutter or darkness of the video clip or video frame, the evaluation score based on the aesthetic feature analysis is low. The method screens out candidate viewpoint segments suitable for viewing through aesthetic feature evaluation.

Alternatively, the aesthetic feature evaluation result may be an aesthetic score, and the method extracts a plurality of key frames from the candidate video segments at equal intervals for aesthetic evaluation, and scores the plurality of key frames through a trained aesthetic evaluation model, thereby screening the candidate video segments based on the resulting scores to remove candidate video segments having scores below a threshold.

3) The multiple candidate point segments are evaluated for their level of highlighting by analyzing the bullet screen data of the target video.

For example, the level of highlighting of the corresponding candidate point of view segment may be evaluated based on the number of shots or the textual content of the shots, etc.

Optionally, the method selects a corresponding evaluation mode to evaluate the plurality of candidate viewpoint segments based on the type of the target video. For example, for a video of the two-dimensional animation type, the evaluation is performed by using a dynamic scene evaluation method, and for a video of the documentary type, the evaluation is performed by using an aesthetic feature evaluation method.

The method of the present embodiment will be described with reference to fig. 5.

Fig. 5 shows a schematic diagram of an exemplary multi-modal high-energy point of view extraction scheme. The multi-mode refers to collaborative reasoning of various heterogeneous mode data, for example, the mode data including text, voice, pictures or video and the like are processed cooperatively, so as to provide a richer and more comprehensive information expression and analysis mode. Also, the method shown in fig. 5 is performed by a server of the video website for extracting high-energy viewpoints of long videos.

Referring to fig. 5, in the preparation phase, subtitles of one or more target videos to be processed are acquired. If the external subtitle of the target video can be directly acquired, the acquired external subtitle is used as subtitle information, and if the external subtitle cannot be directly acquired, the subtitle identification is performed through an ASR technology/OCR technology, so that the corresponding subtitle information is extracted.

And then, the associated video media information of the target video, including the name, style, brief introduction, diversity title and the like of the work, is processed by the great language model through the mode of fine tuning (P-turn) by a user prompt, so that the great language model extracts a specified number of high-energy viewpoint fragments according to the subtitle information. Since the caption information of the long video to be processed is large, the caption information is segmented when the large language model is processed.

Then, LLM analysis is performed by using a large language model to obtain n segments as candidate high-energy viewpoint segments based on understanding of subtitles and basic stories of the target video through preliminary extraction.

Next, in the step of multi-modal analysis, n segments as candidate high-energy segments are screened by performing dynamic scene evaluation and aesthetic feature evaluation on the candidate high-energy segments. And, the highlight degree of the candidate high-energy fragments is also evaluated based on the barrage data, so that the n fragments are further screened, and the fragments obtained after screening are used as a plurality of final high-energy fragments for generating high-energy viewpoints.

According to the method, the viewpoint extraction can be performed on various types of videos, and the accuracy of the finally generated video viewpoint is improved in a manner of performing evaluation such as dynamic scene analysis or aesthetic feature analysis in a video segment for generating the video viewpoint; and compared with a mode of manually identifying high-energy points of view of the video, the efficiency is greatly improved.

Fig. 6 shows a schematic structural diagram of an apparatus for generating a video viewpoint according to an embodiment of the present application.

The device comprises: a means for acquiring viewpoint reference information of a target video (hereinafter referred to as "information acquisition means 101"), a means for obtaining a plurality of target viewpoint segments included in the target video by analyzing and processing the target video based on the viewpoint reference information (hereinafter referred to as "segment acquisition means 102"), and a means for generating viewpoint information of the target video based on the target viewpoint segments (hereinafter referred to as "viewpoint generation means 103").

The information acquisition device 101 acquires viewpoint reference information of a target video.

The process of acquiring the two-shot video matching with the target video and the process of acquiring the subtitle information of the target video will be described below with reference to the structural diagrams of the apparatus shown in fig. 7 and 8, respectively.

The segment obtaining device 102 obtains a plurality of target viewpoint segments contained in the target video by analyzing and processing the target video based on the viewpoint reference information.

According to one embodiment, the viewpoint reference information is a two-creation video matched with the target video, and the method obtains the target viewpoint fragment of the target video by analyzing the coincident video fragment of the two-creation video and the target video. The information acquisition apparatus 101 of the present embodiment further includes a two-shot acquisition apparatus 1011, and the fragment acquisition apparatus 102 further includes a coincidence fragment acquisition apparatus 1021 and a coincidence fragment screening apparatus 1022. The operations of the above-described two-wound acquiring apparatus 1011, combined fragment acquiring apparatus 1021, and overlapped fragment screening apparatus 1022 will be described later in detail with reference to a schematic configuration of the apparatus shown in fig. 7.

According to one embodiment, the viewpoint reference information is caption information matched with the target video, and the method analyzes the caption information by using a language model to obtain a target viewpoint fragment of the target video. The information acquisition device 101 of the present embodiment further includes a subtitle acquisition device 1012, and the clip acquisition device 102 further includes a language analysis device 1023, a candidate clip acquisition device 1024, and a candidate clip screening device 1025. The operations of the above-described screen acquisition device 1012, language analysis device 1023, candidate segment acquisition device 1024, and candidate segment screening device 1025 will be described in detail later with reference to the schematic configuration diagram of the device shown in fig. 8.

The viewpoint generating means 103 generates video viewpoint information of the target video based on the target viewpoint segment.

According to the device provided by the embodiment of the application, the target video is analyzed and processed based on the target video matched two-creation video or the subtitle information of the target video, so that the highlight segments or the high-energy points contained in the target video are automatically obtained, and compared with the mode of manually identifying the high-energy points of the video, the efficiency is greatly improved.

Fig. 7 shows a schematic structural diagram of an apparatus for generating video viewpoints based on target video matched two-shot video according to an embodiment of the present application.

The device comprises: a device for searching based on fingerprint information of a target video to obtain a plurality of target two-shot videos matching the target video (hereinafter referred to as a "two-shot acquisition device 1011"), a device for acquiring a plurality of overlapping video segments (hereinafter referred to as a "overlapping segment acquisition device 1021") of the plurality of target two-shot videos matching the target video, a device for screening the plurality of overlapping video segments so as to take the overlapping video segments obtained after screening as target viewpoint segments (hereinafter referred to as a "overlapping segment screening device 1022"), and a viewpoint generation device 103.

The two-shot acquiring device 1011 searches based on fingerprint information of the target video to obtain a plurality of target two-shot videos matching the target video.

Optionally, the two-creation obtaining device 1011 searches a pre-established video fingerprint library based on fingerprint information of the target video to obtain a plurality of target two-creation videos matched with the target video, and the method according to the present embodiment further comprises a fingerprint extracting device and a fingerprint library establishing device.

The fingerprint extraction device obtains fingerprint information corresponding to a plurality of in-store videos based on a preset fingerprint extraction algorithm, wherein the in-store videos comprise original videos and two-shot videos.

The fingerprint library establishing device establishes a video fingerprint library based on the video fingerprint information of the in-library video.

Optionally, the method according to the present embodiment selects the plurality of the two-creation videos that are matched with each other, and the two-creation acquisition device 1011 searches based on the fingerprint information of the target video, to obtain a plurality of candidate two-creation videos that are matched with the target video. Next, the two-shot acquiring device 1011 screens the candidate two-shot video based on a predetermined selection condition, and takes the two-shot video satisfying the selection condition as a target two-shot video.

Optionally, the selection condition includes at least any one of the following:

1) The video play amount exceeds a predetermined play amount threshold;

2) The video playing time exceeds a preset time threshold;

Optionally, the two-creation obtaining device 1011 sorts the obtained plurality of candidate two-creation videos according to a predetermined sorting rule, so as to select a predetermined number of candidate two-creation videos with a top sorting as the target two-creation videos.

Continuing to refer to fig. 7, the overlapping-segment acquiring device 1021 acquires a plurality of overlapping video segments that match the plurality of target two-shot videos with the target video.

Specifically, for each target two-shot video, the overlapping segment acquiring device 1021 obtains a video frame set of the target two-shot video by performing frame cutting processing on the target two-shot video. Next, similarity information of each video frame in the set of video frames to the target video is calculated based on a predetermined similarity calculation rule. Wherein the similarity information includes various information that can be used to indicate the degree of similarity between video frames, such as mean square error, cosine similarity, or euclidean distance, etc. And then, carrying out similarity matching on the video frames of the video frame set and the target video based on the similarity information, so as to obtain one or more coincident video fragments of the target two-creation video and the target video based on a similarity matching result.

If the matching video frame included in the similarity matching result corresponds to one or more play clips, the overlapping clip obtaining device 1021 directly uses the clip corresponding to the one or more play clips as the overlapping video clip. If the matching video frame included in the similarity matching result corresponds to one or more play time points, for each play time point, the coincidence fragment obtaining device 1021 takes a play fragment of a predetermined length including the play time point as a coincidence video fragment based on a predetermined fragment determination rule.

For example, assume that a video clip 60 seconds after a play time point is taken as a corresponding coincident video clip based on a predetermined clip determination rule. And, if the play time point increases by 60 seconds beyond the total duration of the target two-shot video, the coincidence fragment acquisition means 1021 takes a fragment from the play time point to the end of the target two-shot video as a coincidence video fragment. Assuming that for a target two-creation video with a duration of 5 minutes, the video frame of the two-creation video is subjected to similarity matching with the target video, and the obtained matching result corresponds to 1 minute and 23 seconds of the playing time point. Based on a predetermined segment rule, if it is determined that the 1 minute 23 second increase does not exceed the total duration of the target two-shot video for 60 seconds, the overlapping segment acquiring device 1021 takes the video segment 60 seconds after the 1 minute 23 second in the target two-shot video as the overlapping video segment.

Then, the overlapping-segment screening device 1022 performs a screening process on the plurality of overlapping video segments, so that the overlapping video segments obtained after the screening are used as target viewpoint segments.

Optionally, the manner in which the overlapping video segments are screened by the overlapping segment screening apparatus 1022 includes, but is not limited to, at least any one of the following:

specifically, the overlapping segment screening device 1022 obtains playing area information corresponding to the plurality of overlapping video segments in the target video; and then, based on the playing area information, eliminating the overlapped video clips of which the playing areas belong to the preset target elimination areas.

2) Screening based on the play time length of the coincident video clips;

specifically, the overlapping segment screening device 1022 takes playing duration information corresponding to the plurality of overlapping video segments; and then, based on the playing time length information, eliminating the coincident video clips with the playing time length smaller than a preset threshold value.

3) Screening based on the content of the coincident video segments;

Optionally, the overlapping section screening device 1022 determines whether the overlapping video section contains offending content such as soft pornography or bloody violence based on a predetermined content auditing rule, thereby excluding the overlapping video section containing the offending content.

Next, the viewpoint generating means 103 generates one or more pieces of high-energy viewpoint information of the target video based on the target viewpoint segments.

The operation of the point of view generating means 103 has been described in the foregoing and is not described in detail here.

According to the device of the embodiment, the high-energy fragments or the highlight fragments of the video can be rapidly and accurately positioned by fully utilizing the data of the two-invasive content, particularly for the long video with larger flow.

Fig. 8 shows a schematic structural diagram of an apparatus for generating a video viewpoint based on subtitle information according to an embodiment of the present application.

The device comprises: a means for acquiring subtitle information of a target video (hereinafter referred to as "subtitle acquisition means 1012"), a means for obtaining corresponding viewpoint indication information by performing language analysis on the subtitle information using a predetermined language model (hereinafter referred to as "language analysis means 1023"), a means for determining a plurality of candidate viewpoint segments included in the target video based on the viewpoint indication information (hereinafter referred to as "candidate segment acquisition means 1024"), a means for screening the plurality of candidate viewpoint segments by performing evaluation processing on the plurality of candidate viewpoint segments, thereby taking the candidate viewpoint segments obtained after screening as target viewpoint segments (hereinafter referred to as "candidate segment screening means 1025"), and a viewpoint generation means 103.

Referring to fig. 8, the subtitle acquisition apparatus 1012 acquires subtitle information of a target video.

The caption information of the target video is acquired by the caption acquisition device 1012, which includes, but is not limited to, at least any one of the following:

1) Acquiring the plug-in subtitle information of a target video;

The language analysis device 1023 performs language analysis on the subtitle-related information by using a predetermined language model to obtain corresponding viewpoint indication information.

Optionally, the apparatus according to the present embodiment further includes a data acquisition apparatus and a fine adjustment processing apparatus.

The data acquisition means acquires trimming processing data for performing trimming processing.

A fine tuning processing means performs fine tuning processing on the language model based on the fine tuning processing data.

Optionally, the fine tuning processing means performs fine tuning by means of a user prompting fine tuning (P-tuning). The user prompts fine-tuning the preset language model to have sufficient prior knowledge, and the language model is only required to better understand the user instructions by adjusting an editor of the user prompt to execute related tasks.

Alternatively, the language analysis device 1023 performs a segmentation process on the caption information to obtain a plurality of pieces of caption information, and processes each piece of caption information using the language model.

Next, the candidate segment acquisition device 1024 determines a plurality of candidate viewpoint segments included in the target video based on the viewpoint instruction information.

The candidate segment screening means 1025 screens the plurality of candidate viewpoint segments by performing evaluation processing on the plurality of candidate viewpoint segments, thereby taking the candidate viewpoint segment obtained after the screening as a target viewpoint segment.

Wherein, the candidate segment screening device 1025 performs the evaluation processing on the candidate viewpoint segments to screen the candidate viewpoint segments, which includes but is not limited to at least any one of the following:

the dynamic scene evaluation mode adopts scene switching frequency to evaluate the dynamic property of the video clips, and the clip screening device 1025 screens candidate viewpoint clips with high scene switching frequency through the dynamic scene evaluation result.

Specifically, the candidate segment screening device 1025 may calculate pixel differences between successive frames of the candidate video segments by a method based on a Scene cut (Scene cut) of a threshold, and determine whether Scene switching occurs based on the pixel differences and a predetermined difference threshold, thereby obtaining a Scene switching frequency corresponding to each candidate video segment.

wherein the aesthetic feature evaluation is performed in such a manner as to evaluate the degree of viewing suitability of the video clip from an aesthetic point of view, for example, if there is a picture unsuitable for recommendation to the user such as clutter or darkness of the video clip or video frame, the evaluation score based on the aesthetic feature analysis is low. The candidate segment screening device 1025 screens candidate viewpoint segments suitable for viewing through aesthetic feature evaluation.

Alternatively, the aesthetic feature evaluation result may be an aesthetic score, and the candidate segment screening device 1025 extracts a plurality of key frames for aesthetic evaluation from the candidate video segments at equal intervals, and scores the plurality of key frames through a trained aesthetic evaluation model, thereby screening the candidate video segments based on the obtained scores to remove candidate video segments having scores below a threshold.

For example, the candidate segment screening device 1025 may evaluate the level of highlighting of the corresponding candidate point segments based on the number of barrages or the text content of the barrages, etc.

Optionally, the candidate segment screening device 1025 selects a corresponding evaluation mode to perform evaluation processing on the candidate viewpoint segments based on the type of the target video. For example, for a video of the two-dimensional animation type, the evaluation is performed by using a dynamic scene evaluation method, and for a video of the documentary type, the evaluation is performed by using an aesthetic feature evaluation method.

According to the device, the viewpoint extraction can be performed on various types of videos, and the accuracy of the finally generated video viewpoint is improved by evaluating the dynamic scene or aesthetic feature analysis and the like in the video segment for generating the video viewpoint.

Based on the same inventive concept, an electronic device is further provided in the embodiments of the present application, and the method corresponding to the electronic device may be the method for generating a video viewpoint in the foregoing embodiments, and the principle of solving the problem is similar to that of the method. The electronic device provided by the embodiment of the application comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods and/or aspects of the various embodiments of the present application described above.

The electronic device may be a user device, or a device formed by integrating the user device and a network device through a network, or may also be an application running on the device, where the user device includes, but is not limited to, a computer, a mobile phone, a tablet computer, a smart watch, a bracelet, and other various terminal devices, and the network device includes, but is not limited to, a network host, a single network server, a plurality of network server sets, or a computer set based on cloud computing, where the network device is implemented, and may be used to implement a part of processing functions when setting an alarm clock. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.

Fig. 9 shows a structure of an apparatus suitable for implementing the method and/or technical solution in the embodiments of the present application, the apparatus 1200 includes a central processing unit (CPU, central Processing Unit) 1201, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a random access Memory (RAM, random Access Memory) 1203. In the RAM 1203, various programs and data required for the system operation are also stored. The CPU 1201, ROM 1202, and RAM 1203 are connected to each other through a bus 1204. An Input/Output (I/O) interface 1205 is also connected to the bus 1204.

The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, mouse, touch screen, microphone, infrared sensor, etc.; an output portion 1207 including a display such as a Cathode Ray Tube (CRT), a liquid crystal display (LCD, liquid Crystal Display), an LED display, an OLED display, or the like, and a speaker; a storage portion 1208 comprising one or more computer-readable media of hard disk, optical disk, magnetic disk, semiconductor memory, etc.; and a communication section 1209 including a network interface card such as a LAN (local area network ) card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet.

In particular, the methods and/or embodiments of the present application may be implemented as a computer software program. For example, embodiments disclosed herein include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 1201.

Another embodiment of the present application also provides a computer readable storage medium having stored thereon computer program instructions executable by a processor to implement the method and/or the technical solution of any one or more embodiments of the present application described above.

In particular, the present embodiments may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowchart or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple elements or page components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. A method for generating a video viewpoint, wherein the method comprises:

2. The method of claim 1, wherein the point of view reference information is a plurality of two-shot videos to which the target video matches, and the obtaining the point of view reference information of the target video comprises:

searching based on fingerprint information of a target video to obtain a plurality of target two-creation videos matched with the target video;

the analyzing the target video based on the viewpoint reference information to obtain a plurality of target viewpoint fragments contained in the target video includes:

Acquiring a plurality of coincident video clips matched with the target videos;

and screening the multiple coincident video clips, so that the screened coincident video clips are used as target viewpoint clips.

3. The method of claim 2, wherein the acquiring the plurality of overlapping video segments of the plurality of target noninvasive videos that match target video comprises:

for each target two-creation video, obtaining a video frame set of the target two-creation video by carrying out frame cutting processing on the target two-creation video;

calculating similarity information of each video frame in the video frame set and a target video based on a preset similarity calculation rule;

and carrying out similarity matching on the video frames of the video frame set and the target video based on the similarity information, so as to obtain one or more coincident video fragments of the target two-creation video and the target video based on a similarity matching result.

4. A method according to claim 2 or 3, wherein the method further comprises:

based on a preset fingerprint extraction algorithm, fingerprint information corresponding to a plurality of in-store videos is obtained, wherein the in-store videos comprise original videos and two-shot videos;

And establishing a video fingerprint library based on the video fingerprint information of the in-library video.

5. A method according to claim 2 or 3, wherein said screening of the plurality of overlapping video segments comprises:

acquiring corresponding playing area information of the plurality of coincident video clips in the target video;

and eliminating the coincident video clips of which the playing areas belong to the preset target elimination areas based on the playing area information.

6. A method according to claim 2 or 3, wherein said screening of the plurality of overlapping video segments comprises:

acquiring playing time length information corresponding to the multiple coincident video clips;

and eliminating the coincident video clips with the playing time length smaller than a preset threshold value based on the playing time length information.

7. A method according to claim 2 or 3, wherein the retrieving based on fingerprint information of the target video, obtaining a plurality of target two-creation videos matching the target video comprises:

searching based on fingerprint information of a target video to obtain a plurality of candidate two-creation videos matched with the target video;

and screening the candidate two-creation videos based on a preset selection condition, and taking the two-creation videos meeting the selection condition as target two-creation videos.

8. The method of claim 7, wherein the video selection condition comprises at least any one of:

the play amount exceeds a predetermined play amount threshold;

the video playing time exceeds a preset time threshold;

the video interaction data meets predetermined requirements.

9. The method of claim 1, wherein the point of view reference information is subtitle related information of a target video, and the obtaining the point of view reference information of the target video includes:

acquiring subtitle information of a target video;

performing language analysis on the caption information by using a preset language model to obtain corresponding viewpoint indication information;

determining a plurality of candidate viewpoint fragments contained in the target video based on the viewpoint indication information;

and screening the candidate viewpoint fragments by carrying out evaluation processing on the candidate viewpoint fragments, so that the candidate viewpoint fragments obtained after screening are used as target viewpoint fragments.

10. The method of claim 9, wherein the method further comprises:

Acquiring trimming processing data for performing trimming processing;

and performing fine tuning processing on the language model based on the fine tuning processing data.

11. The method of claim 9 or 10, wherein the screening the plurality of candidate point of view segments by performing an evaluation process on the plurality of candidate point of view segments comprises:

the dynamic property of the candidate viewpoint fragments is evaluated by performing dynamic scene evaluation, so that the candidate viewpoint fragments are screened based on a dynamic scene evaluation result.

12. The method of claim 9 or 10, wherein the screening the plurality of candidate point of view segments by performing an evaluation process on the plurality of candidate point of view segments comprises:

and screening the candidate viewpoint fragments based on the aesthetic feature evaluation result by performing aesthetic feature evaluation on the candidate viewpoints.

13. An apparatus for generating a video viewpoint, wherein the apparatus comprises:

14. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 12.

15. A computer readable medium having stored thereon computer program instructions executable by a processor to implement the method of any of claims 1 to 12.