CN112182297A

CN112182297A - Training information fusion model, and method and device for generating collection video

Info

Publication number: CN112182297A
Application number: CN202011057544.3A
Authority: CN
Inventors: 马彩虹; 叶芷
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-05

Abstract

The application discloses a method and a device for training an information fusion model and generating a collection video, and relates to the technical fields of artificial intelligence such as intelligent recognition, computer vision and deep learning and cloud computing. The specific implementation scheme is as follows: acquiring keywords of a title of an input video; acquiring focused attention information according to images and/or audio of each frame of an input video; determining information to be fused based on the collection attention information and the keywords of the title; inputting information to be fused into an information fusion model, wherein the information fusion model outputs a plurality of pieces of alternative material information including characters, events and start-stop time; and determining target material information from the alternative material information, intercepting videos in a time period corresponding to the target material information from the input videos as target segments, and splicing the target segments into a highlight video. Because the information to be fused is derived from a plurality of dimensionalities of the input video, the generated alternative material information is accurate and comprehensive, different types of collection videos can be generated according to the information, and the diversified requirements of users are met.

Description

Training information fusion model, and method and device for generating collection video

Technical Field

The application relates to the technical field of computers, in particular to the technical field of artificial intelligence such as intelligent recognition, computer vision and deep learning and the technical field of cloud computing, and specifically relates to a method and a device for training an information fusion model and generating a highlight video.

Background

In video applications, after a live event is completed, a video with highlights is often pushed to the user.

In the prior art, the synthesis of the wonderful highlight video has the following two schemes: (1) and editing, auditing and synthesizing the live video manually. (2) Time positioning is carried out on specific wonderful actions, such as specific actions of a football, such as a red/yellow board and the like, machine learning training is carried out, the actions are detected by utilizing a production model, and then wonderful highlights of concerned events are automatically synthesized.

Disclosure of Invention

The application provides a method and a device for training an information fusion model and generating a highlight video.

According to a first aspect, a method for training an information fusion model is provided, which includes obtaining keywords of a title of an input video; acquiring focused attention information according to images and/or audio of each frame of an input video; determining information to be fused based on the collection attention information and the keywords of the title; and taking the information to be fused as the input of an information fusion model, taking the alternative material information which is acquired from an input video and comprises characters, events and start-stop time as the expected output of the information fusion model corresponding to the information to be fused, and training an initial model of the information fusion model to obtain the trained information fusion model.

According to a second aspect, there is provided a method of generating a highlight video, comprising: acquiring keywords of a title of an input video; acquiring focused attention information according to images and/or audio of each frame of an input video; determining information to be fused based on the collection attention information and the keywords of the title; inputting the information to be fused into an information fusion model trained by the method for training the information fusion model according to any one of claims 1-8, wherein the information fusion model outputs a plurality of pieces of alternative material information including characters, events and start-stop time; and determining target material information from the alternative material information, intercepting videos in a time period corresponding to the target material information from the input videos as target segments, and splicing the target segments into a highlight video.

According to a third aspect, there is provided an apparatus for training an information fusion model, comprising: a title keyword acquisition module configured to acquire a keyword of a title of an input video; the system comprises a highlight attention information acquisition module, a highlight attention information acquisition module and a highlight attention information processing module, wherein the highlight attention information acquisition module is configured to acquire highlight attention information according to images and/or audios of frames of an input video; the information to be fused determining module is configured to determine information to be fused based on the highlight attention information and the keywords of the title; and the model training module is configured to take the information to be fused as the input of an information fusion model, take the alternative material information which is acquired from an input video and comprises the characters, the events and the start-stop time as the expected output of the information fusion model corresponding to the information to be fused, train an initial model of the information fusion model and obtain the trained information fusion model.

According to a fourth aspect, there is provided an apparatus for generating a highlight video, comprising: a title keyword acquisition module configured to acquire a keyword of a title of an input video; the system comprises a highlight attention information acquisition module, a highlight attention information acquisition module and a highlight attention information processing module, wherein the highlight attention information acquisition module is configured to acquire highlight attention information according to images and/or audios of frames of an input video; the information to be fused determining module is configured to determine information to be fused based on the highlight attention information and the keywords of the title; the information fusion module is configured to input the information to be fused into an information fusion model trained by the method for training the information fusion model, and the information fusion model outputs a plurality of pieces of alternative material information including characters, events and start-stop time; and the highlight video generation module is configured to determine target material information from the alternative material information, intercept videos in a time period corresponding to the target material information from the input video as target segments, and splice the target segments into a highlight video.

Compared with the prior art that the manual synthesis of the highlight video is time-consuming and labor-consuming or the highlight video for automatically synthesizing the attention event is low in dimensionality and poor in pertinence through motion detection, the technical scheme of the application firstly obtains key words of a title according to the title of the input video, obtains highlight attention information according to images and/or audios of frames, then determines information to be fused according to the highlight attention information and the key words of the title, finally obtains alternative material information according to the information to be fused and generates the highlight video according to the alternative material information. In the process, the information to be fused is derived from multiple dimensions such as the title, the image and/or the audio of the input video, so that the dimension of the generated alternative material information is high, accurate and comprehensive, different types of collection videos can be generated according to the dimension, and the diversified requirements of users are met.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram to which some of the present application may be applied

FIG. 2 is a flow diagram of one embodiment of a method of training an information fusion model according to the present application;

FIG. 3 is a flow diagram of another embodiment of a method of training an information fusion model according to the present application;

FIG. 4 is a flow diagram of one embodiment of a method of generating a highlight video according to the present application;

FIG. 5 is a flow diagram of another embodiment of a method of generating a highlight video according to the present application;

FIG. 6 is a scene diagram of a method of generating a highlight video according to the present application;

FIG. 7 is a block diagram illustrating an embodiment of an apparatus for training an information fusion model according to the present application;

FIG. 8 is a schematic block diagram of one embodiment of an apparatus for generating a highlight video according to the present application;

FIG. 9 is a block diagram of a computer system suitable for use in implementing a server or terminal according to some embodiments of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the method and apparatus for training an information fusion model, or generating a highlight video, of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between any two of the

terminal devices

101, 102, 103, and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various applications, such as various client applications, multi-party interactive applications, artificial intelligence applications, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices that support document processing applications, including but not limited to smart terminals, tablets, laptop and desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for the

terminal devices

101, 102, 103. The background server can analyze and process the received data such as the request and feed back the processing result to the terminal equipment.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules, for example, to provide distributed services, or as a single piece of software or software module. And is not particularly limited herein.

In practice, the method for training the information fusion model or generating the highlight video provided in the embodiment of the present application may be performed by the

terminal device

101, 102, 103 or the server 105, and the apparatus for training the information fusion model or generating the highlight video may also be disposed in the

terminal device

101, 102, 103 or the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

In the prior art, the synthesis of the wonderful highlight video has the following two schemes: (1) and editing, auditing and synthesizing the live video manually. The defects are that the manual investment is large and the analysis speed is slow. (2) Time positioning is carried out on specific wonderful actions, such as specific actions of a football, such as a red/yellow board and the like, machine learning training is carried out, the actions are detected by utilizing a production model, and then wonderful highlights of concerned events are automatically synthesized. The defects are that the composite video is usually a collection of a plurality of events, the analysis dimensionality is low, and the pertinence of the highlight information is poor.

Referring to FIG. 2, FIG. 2 illustrates a flow 200 of one embodiment of a method of training an information fusion model according to the present application. The method for training the information fusion model comprises the following steps:

s201, a keyword of a title of an input video is acquired.

In this embodiment, an executing entity (for example, a terminal or a server in fig. 1) of the method for training the information fusion model may obtain a keyword of a title of an input video.

The method for obtaining the keywords of the title of the input video may be a method for obtaining the keywords of the title of the input video in the prior art or a future developed technology, and is not limited in this application. For example, keywords of a title of an input video may be acquired through a text keyword extraction model (e.g., various neural network models). For another example, the title of the video may be subjected to operations such as text extraction, deletion, or filtering by using a preset rule (for example, using a template), so as to obtain a keyword of the title of the input video.

For example, the title of the input video is "2014 world cup half-duet germany VS brazil", and keywords for the title can be extracted, including: the competition time is as follows: 2014. the type of game is: world cup semi-duet, two team names of competition: VS brazil, germany.

S202, acquiring the concentrated attention information according to the image and/or audio of each frame of the input video.

In this embodiment, the execution subject of the method for training the information fusion model may obtain the focused attention information according to the image and/or audio of each frame of the input video.

The highlight attention information may be a target image extracted from an image of each frame and/or a target audio extracted from an audio of each frame.

Alternatively, the highlight attention information may include text information such as information related to the attention object, information related to the attention event, and the like, and the candidate material information including the person, the time, and the start-stop time may be further generated from the text information. The highlight attention information may also include textual information related to the hotspot for enhancing the attention object and/or the attention event.

The execution body may input the image and/or audio of each frame of the input video into each attention information extraction model (for example, various neural network models), and the attention information extraction model outputs attention information corresponding to the image and/or audio of each frame. The input of each model for extracting the attention information may be one of an image of each frame and an audio of each frame, or may include both an image of each frame and an audio of each frame.

S203, determining information to be fused based on the gathering attention information and the keywords of the title.

In this embodiment, the executing body of the method for training the information fusion model may determine the information to be fused based on the focused attention information and the keyword of the title.

The execution main body can directly take the key words of the focused attention information and the title as the information to be fused.

Or, the executing body may operate the key words of the focused attention information and the title to obtain the information to be fused. The following briefly illustrates several possible operations:

as an example, the key words for gathering the attention information and the title may be information of multiple modalities, the information of each modality is respectively subjected to characterization processing to obtain corresponding initial feature vectors, and the initial feature vectors are used as information to be fused. Or, the vectors may be spliced, or the sum and product of the vectors may be calculated, which is used as the information to be fused.

As another example, the keywords of the focused attention information and the title may be used as a set, the similarity of each element in the set is obtained, and the subset is divided according to the similarity. Or, the elements in the set may be merged, deleted, and the like according to the similarity.

It can be understood that the method for determining the information to be fused based on the highlight attention information and the keyword of the title includes, but is not limited to, the above-described several cases, and those skilled in the art may determine the information to be fused according to actual requirements, only by enabling the information to be fused to indicate the features of the highlight attention information and the keyword of the title.

S204, using the information to be fused as the input of the information fusion model, using the alternative material information including the characters, the events and the start-stop time, which is obtained from the input video, as the expected output of the information to be fused corresponding to the information fusion model, training the initial model of the information fusion model, and obtaining the trained information fusion model.

In this embodiment, after obtaining the information to be fused and the candidate material information including the person, the event, and the start-stop time acquired from the input video, the execution subject of the method for training the information fusion model may train the initial information fusion model by using the information to be fused and the candidate material information. During training, the execution main body can take the information to be fused as the input of the information fusion model, take the alternative material information as the expected output of the information fusion model, train the initial model of the information fusion model, and obtain the trained information fusion model.

The initial model of the information fusion model may be a multi-modal information fusion model, a text information fusion model, or the like.

For example, the highlight information may include a target image obtained from an image, and a target audio obtained from an audio, and correspondingly, the information to be fused may include information of three modalities, such as an image, an audio, and a text (a keyword of a title), and in this case, the initial model of the information fusion model may be a multi-modal information fusion model.

When the initial model of the information fusion model is a multi-modal information fusion model, the information of each mode can be respectively represented to obtain an initial feature vector corresponding to the information of each mode as information to be fused; and taking the initial feature vector as an input, and performing fusion operation on the initial feature vector by utilizing a preset multi-modal fusion layer structure consisting of a plurality of modal fusion layers to obtain a target feature vector so as to complete the fusion of information of each mode.

As another example, the highlight concern information obtained from the image and/or the audio may be text information. For example, an X time Y character, an X time Z event, etc. are written in text form. Correspondingly, the information to be fused may include text information obtained from multiple dimensions, and at this time, the initial model of the information fusion model may be a text information fusion model.

When the initial model of the information fusion model is a text information fusion model, it may fuse at least two texts including the same element into one text. For example, an X-time Y person, X-time Z event may be fused into: x time Y person Z event.

The method for training the information fusion model provided by the embodiment includes the steps of firstly obtaining keywords of a title according to the title of an input video, obtaining highlight attention information according to images and/or audios of frames, then determining information to be fused based on the highlight attention information and the keywords of the title, and finally training the information fusion model according to the information to be fused and alternative material information. The information to be fused is derived from multiple dimensions of the input video, and the information of the alternative materials is relatively comprehensive, so that the trained information fusion model can generate accurate and comprehensive information of the alternative materials according to the information to be fused of the multiple dimensions.

Referring to FIG. 3, FIG. 3 illustrates a flow 300 of another embodiment of a method of training an information fusion model according to the present application. The method for training the information fusion model comprises the following steps:

s301, a keyword of a title of an input video is acquired.

The operation in this step is substantially the same as the operation in step S201 in the embodiment shown in fig. 2, and is not described again here.

S302, acquiring the attention object and the start-stop time of the attention object, and the attention event and the start-stop time of the attention event according to the images of the frames; and/or acquiring the attention event and the start-stop time of the attention event, and the start-stop time of the hotspot according to the audio frequency of each frame.

In this embodiment, an execution subject of the method for training the information fusion model may obtain the attention object and the start-stop time of the attention object, and the attention event and the start-stop time of the attention event according to the image of each frame; and/or acquiring the attention event and the start-stop time of the attention event, and the start-stop time of the hotspot according to the audio frequency of each frame.

The execution subject may input an image of each frame of the input video into the image analysis model, and the image analysis model outputs the attention object and the start-stop time of the attention object, the attention event, and the start-stop time of the attention event. The image analysis model may include a face detection model, a human body detection model, an action detection model, a caption detection model, or the like.

Alternatively, the execution subject may input the audio of each frame of the input video into the audio analysis model, and the audio analysis model outputs the attention event and the start-stop time of the attention event, and the start-stop time of the hotspot. The audio analysis model may include a semantic recognition model, an audio boiling point detection model, and the like.

Alternatively, the executing entity may input the image and the audio of each frame of the input video into the multi-mode analysis model at the same time, and the multi-mode analysis model outputs the attention object and the start-stop time of the attention object, the attention event and the start-stop time of the attention event, and the hot spot and the start-stop time of the hot spot. The multimodal analysis model may be a model that extracts information from both the image and audio and verifies it. For example, the image and the audio of each frame may be simultaneously input into a preset multi-mode analysis model, the preset multi-mode analysis model performs motion detection on the image to obtain information related to the event of interest, performs semantic recognition on the audio to obtain information related to the event of interest, and outputs the event of interest and the start-stop time of the event of interest after fusion.

The start-stop time of the attention object may refer to a time corresponding to a start frame and an end frame of the attention object appearing in the input video. The start-stop time of the attention event may refer to a time corresponding to a start frame and an end frame of the attention event in the input video, and the start-stop time of the hotspot may refer to a time corresponding to a start frame and an end frame of the hotspot in the input video.

The attention object and the attention event may be given according to the type of the input video. For example, for a ball game video, the objects of interest may include: a player, a referee, a coach, etc. The events of interest may include: goal, replay goal, shoot, replay shoot, red/yellow card, offside flag waving, air confrontation, pointing, free kick, corner kick, boundary kick, score change, etc. The hotspot may be an audio hotspot, or the like.

In a specific example, the attention object and the start-stop time of the attention object may be in the form of: "time T1-T2, certain player in certain team and certain number". The event of interest and the start-stop time of the event of interest may be in the form of: "time T1-T2, goal"; "time T3-T4, air fight". The hot spot and the start-stop time of the hot spot can be in the following forms: "time T1-T2, Hot Point".

Through the step, the attention events can be obtained according to the images or the audios, or simultaneously obtained according to the images and the audios, and a plurality of attention events from different sources are mutually verified, so that the accuracy of the trained information fusion model is higher. And corresponding collection attention information is obtained through the audio frequency and/or the image of each frame, input videos can be analyzed in parallel from multiple dimensions, and the video processing time length is shortened.

And S303, determining information to be fused based on the attention object and the start-stop time of the attention object, the attention event and the start-stop time of the attention event, the hotspot and the start-stop time of the hotspot and the keyword of the title.

In this embodiment, the executing body of the method for training the information fusion model may determine the information to be fused based on the attention object and the start-stop time of the attention object, the attention event and the start-stop time of the attention event, the hotspot and the start-stop time of the hotspot, and the keyword of the title.

The execution subject may directly use the attention object and the start-stop time of the attention object, the attention event and the start-stop time of the attention event, and the hotspot and the start-stop time of the hotspot as the information to be fused.

Or, obtaining the information to be fused after operating the attention object and the start-stop time of the attention object, the attention event and the start-stop time of the attention event, the hotspot and the start-stop time of the hotspot and the keyword of the title based on a preset rule. Several possible operations are shown below:

as an example, the start and end times of the attention object and the attention object, the start and end times of the attention event and the attention event, the start and end times of the hotspot and the hotspot, and the keyword of the title may be vectorized, and then the obtained vectors are spliced, or the sum and product of the vectors are obtained, as the information to be fused.

As another example, the start and end times of the attention object and the attention object, the start and end times of the attention event and the attention event, the start and end times of the hotspot and the hotspot, and the keyword of the title may be used as a set, the similarity between the elements in the set is obtained, and the subset is divided according to the similarity. Or, the elements in the set may be merged, deleted, and the like according to the similarity.

It can be understood that the method for determining the information to be fused based on the start and end times of the attention object and the attention object, the start and end times of the attention event and the attention event, the start and end times of the hotspot and the hotspot, and the keyword of the title includes, but is not limited to, the above-mentioned several cases, and those skilled in the art can determine the information to be fused according to actual needs, and only need to enable the information to be fused to indicate the feature of the keyword for gathering the attention information and the title

S304, using the information to be fused as the input of the information fusion model, using the alternative material information including the characters, the events and the start-stop time, which is obtained from the input video, as the expected output of the information to be fused corresponding to the information fusion model, training the initial model of the information fusion model, and obtaining the trained information fusion model.

For example, the information to be fused may be text information determined based on the attention object and the start-stop time of the attention object, the attention event and the start-stop time of the attention event, the hotspot and the start-stop time of the hotspot, and keywords of the title. The alternative material information may be text information including characters, events, and materials. The initial model of the information fusion model may be a text information fusion model.

The initial model of the information fusion model may be a model that fuses at least two texts including the same element into one text. For example, the information to be fused may be: x time Y people, X time Z events. The alternative story information may be an X time Y person Z event.

According to the method for training the information fusion model, information to be fused is determined based on the acquired keywords of the title, the start and stop times of the attention object and the attention object, the start and stop times of the attention event and the attention event, and the start and stop times of the hotspot and the hotspot, and then the information fusion model is trained according to the information to be fused and the alternative material information. The information to be fused is derived from multiple dimensions of the input video, and the information of the alternative materials is relatively comprehensive, so that the trained information fusion model can generate accurate and comprehensive information of the alternative materials according to the information to be fused of the multiple dimensions. Corresponding collection attention information is obtained through the audio and/or images of each frame, input videos can be analyzed in parallel from multiple dimensions, and the video processing time length is shortened.

In some optional implementations of step S302 of the above embodiment shown in fig. 3, the acquiring the attention object and the start-stop time of the attention object according to the image of each frame includes:

inputting the image of each frame into a preset face detection model to obtain a face frame output by the face detection model and start-stop time of the face frame; inputting the image of each frame into a preset human body detection model to obtain a human body detection frame output by the human body detection model and start-stop time of the human body detection frame; and determining the start-stop time of the attention object and the attention object based on the start-stop time of the face frame and the start-stop time of the human body detection frame and the human body detection frame.

The input of the face detection model is the image of each frame, and the output is the face frame and the start-stop time of the face frame. The attention object is different in different videos, and the output is also different. For example, in a football game, the face frames output may be the face frames of players, referees and coaches.

The human body detection model inputs images of each frame and outputs a human body detection frame and the start-stop time of the human body detection frame. The human body detection model can identify the corresponding human body frame through body type identification, uniform style identification, uniform player name identification, uniform player number identification and/or the like.

The human face detection model and the human body detection model can be any deep learning model, and can be various neural network models, for example.

The determining the start-stop time of the attention object and the attention object based on the start-stop time of the face frame and the start-stop time of the human body detection frame and the human body detection frame may include: and fusing the start-stop time of the face frame and the start-stop time of the human body detection frame, and obtaining the start-stop time of the attention object and the start-stop time of the attention object through mutual correction and complementation of the face frame and the human body detection frame.

Further, the start and end times of the face frame and the start and end times of the human body detection frame and the human body detection frame can be further confirmed by inputting information (such as a player introduction picture appearing when the match is played, a replacement picture when the match is changed, and the like) appearing in the video and related to the attention object, so that the accuracy of the start and end times of the attention object and the attention object can be improved.

In the optional implementation mode, the start-stop time of the face frame and the start-stop time of the face frame, the start-stop time of the human body detection frame and the start-stop time of the human body detection frame are obtained respectively, the start-stop time of the face frame and the start-stop time of the face frame, the start-stop time of the human body detection frame and the start-stop time of the human body detection frame are fused to obtain the attention object and the attention object, the attention object information can be obtained accurately and comprehensively, and mistakes and omissions are avoided.

In some optional implementations of step S302 of the above embodiment shown in fig. 3, the above acquiring the event of interest and the start-stop time of the event of interest according to the image of each frame includes at least one of: inputting the image of each frame into a preset caption detection model to obtain caption information output by the caption detection model, filtering the caption information and extracting key information to obtain an attention event and the starting and ending time of the attention event; and inputting each frame image into a preset action detection model to obtain an attention event output by the action detection model and the start-stop time of the attention event.

For input video, the subtitles in each frame of image include rich information, such as real-time score information of the upper right corner of a football game, information of people changing appearing below the image, information of red and yellow cards, goal information, and the like. Therefore, the event of interest and the start-stop time of the event of interest can be acquired from the information related to the event of interest included in the subtitle.

After the caption information is acquired through the caption detection model, the caption information includes a large amount of information unrelated to the event of interest in addition to the information related to the event of interest. At this time, the subtitle information needs to be filtered and keyword extracted to obtain the first event of interest and the start-stop time of the first event of interest.

The subtitle detection model may be an Optical Character Recognition (OCR) model or other neural network models. The subtitle information is filtered and key information is extracted through algorithms such as textRank and Lexrank.

By detecting the motion corresponding to the event of interest, for example, detecting the motion of a goal, a shot, an air fight, etc., when the corresponding motion is detected, the event of interest and the start-stop time of the event of interest can be specified.

In the optional implementation manner, through subtitle detection, filtering of subtitle information and key information extraction, the event of interest and the start-stop time of the event of interest can be obtained from the content related to the event of interest in the subtitle, and/or the event of interest and the start-stop time of the event of interest are obtained through an action detection model, so that the dimensionality of the information to be fused is improved.

In some optional implementations of step S302 of the above embodiment shown in fig. 3, the acquiring, according to the audio of each frame, the event of interest and the start-stop time of the event of interest includes: inputting the audio frequency of each frame into a preset voice recognition model to obtain character information which is output by the voice recognition model and corresponds to the audio frequency; and extracting keywords from the text information to obtain the attention event and the start-stop time of the attention event.

The audio of each frame of the input video will typically include information related to the event of interest. For example, a announcer's voice message in a ball game may contain various events on the field. Therefore, the text information corresponding to the audio includes information related to the event of interest.

The text information corresponding to the audio includes a large amount of irrelevant information in addition to the information related to the event of interest. Therefore, keyword extraction is required for the text information to acquire the event of interest and the start-stop time of the event of interest from the text information.

The speech recognition model may be an Acoustic Model (AM) and/or a speech model (LM). The keyword extraction is carried out on the character information through algorithms such as textRank and Lexrank.

In the optional implementation mode, through voice recognition and keyword extraction of text information, the attention event and the start-stop time of the attention event can be obtained from the content related to the attention event in the audio of each frame, and further the dimensionality of the information to be fused is improved.

In some optional implementation manners of step S302 in the embodiment shown in fig. 3, the acquiring the hot spot and the start-stop time of the hot spot according to the audio of each frame includes: and inputting the audio of each frame into a preset boiling point detection model to obtain a hot point output by the boiling point detection model and the starting and stopping time of the hot point.

The boiling point detection model can be any deep learning model, such as various neural network models.

In the optional implementation mode, the hot spot and the starting and stopping time of the hot spot can be conveniently obtained through the boiling point detection model, the concerned event can be verified through the hot spot, and then the dimension of the information to be fused and the accuracy of the trained information fusion model are improved.

In some optional implementations of step S303 of the above-described embodiment shown in fig. 3, the information to be fused is determined based on the highlight attention information, the confidence of the highlight attention information, and the keyword of the title. The alternative material information in step S304 further includes at least one of: the confidence level of the person, the confidence level of the event, the confidence level of the start-stop time, and the overall confidence level of the alternative story information.

The confidence of the focused attention information may be the confidence of the label, or the confidence of the model output for outputting the focused attention information. The confidence degrees in the alternative material information can be the confidence degrees marked by the confidence degrees in the corresponding highlight attention information.

In the optional implementation mode, confidence degrees are respectively marked on the input and the output of the information fusion model, so that the trained information fusion model can output the alternative material information comprising the confidence degrees according to the input comprising the confidence degrees, the alternative material information with higher confidence degrees is convenient to select when the collection is generated, and the accuracy is improved.

Referring to fig. 4, fig. 4 shows a flow 400 of one embodiment of a method of generating a highlight video according to the present application. The method for generating the collection video comprises the following steps

S401, keywords of the title of the input video are obtained.

In this embodiment, an execution subject (e.g., a terminal or a server in fig. 1) of the method for generating the highlight video may obtain a keyword of a title of the input video.

S402, acquiring the concentrated attention information according to the image and/or audio of each frame of the input video.

In this embodiment, the execution main body of the method for generating the highlight video may obtain the highlight attention information according to the image and/or the audio of each frame of the input video.

And S403, determining information to be fused based on the collection attention information and the keywords of the title.

In this embodiment, the execution subject of the method for generating the highlight video may determine the information to be fused based on the highlight attention information and the keyword of the title.

The operations of the steps S401 to S403 are substantially the same as the operations of the steps S201 to S203 in the embodiment shown in fig. 2, and are not repeated herein.

S404, inputting the information to be fused into the information fusion model trained by the method for training the information fusion model in the embodiment, wherein the information fusion model outputs a plurality of pieces of alternative material information including characters, events and start-stop time.

In this embodiment, the executing agent of the method for generating the highlight video may input the information to be fused into the information fusion model trained by the method for training the information fusion model according to the above embodiment, and the information fusion model outputs a plurality of pieces of alternative material information including characters, events, and start-stop times.

For example, the information to be fused may be information of multiple modalities, or may be text information obtained from multiple modalities. Correspondingly, the information fusion model can be a multi-modal information fusion model or a text information fusion model.

When the information fusion model is a multi-mode information fusion model, the information of each mode can be respectively represented to obtain an initial feature vector corresponding to the information of each mode as information to be fused; and then taking the initial feature vector as an input, performing fusion operation on the initial feature vector by utilizing a preset multi-modal fusion layer structure consisting of a plurality of modal fusion layers to obtain a target feature vector so as to complete the fusion of information of each modal, and outputting alternative material information comprising people, events and start-stop time.

When the initial model of the information fusion model is a text information fusion model, at least two texts comprising the same element can be fused into one text, and the alternative material information comprising the character, the event and the start-stop time is output. For example, an X-time Y person, X-time Z event may be fused into: x time Y person Z event.

S405, determining target material information from the alternative material information, intercepting videos in a time period corresponding to the target material information from the input videos to serve as target segments, and splicing the target segments into a centralized video.

In this embodiment, the execution main body of the method for generating the highlight video may determine the target material information from the alternative material information, intercept a video within a time period corresponding to the target material information from the input video as a target segment, and splice the target segment into the highlight video.

When the target material information is determined from the candidate material information, all the candidate material information may be used as the target material information. Or, the target material information can be determined from the alternative material information based on a preset rule according to the type of the collection video to be generated, and the generated collection video only needs to meet the requirement.

Because the alternative material information comprises the start-stop time, after the target material information is determined, the target segment corresponding to the target material information can be intercepted from the input video according to the start-stop time in the target material information, and then the target segment is spliced to obtain the collection video.

The method for generating the highlight video includes the steps of firstly obtaining keywords of a title according to the title of an input video, obtaining highlight attention information according to images and/or audio of frames, then determining information to be fused based on the highlight attention information and the keywords of the title, finally obtaining alternative material information according to the information to be fused, and generating the highlight video according to the alternative material information. The information to be fused is derived from multiple dimensions of the input video, so that the dimension of the generated alternative material information is high, accurate and comprehensive, different types of collection videos can be generated accordingly, and diversified requirements of users are met.

In some optional implementations of step S405 of the above-described embodiment shown in fig. 4, the target material information is determined from the alternative material information based on the following rules: candidate story information including the target person and/or the target event is determined as target story information.

The execution subject may determine, as the target material information, the alternative material information including the target person and/or the target event according to the type of the highlight video to be generated. For example, in the case of a video of a ball game, when an album video of a certain player is generated, candidate story information including the player (target person) is set as target story information. When a goal collection is to be generated, candidate material information including a goal (target event) is used as target material information. When a collection of goals for a certain player is to be produced, alternative material information including both the player (target person) and the goal (target event) is used as the target material information.

Through the optional implementation mode, the corresponding collection video can be conveniently and rapidly generated according to the user requirements.

In some optional implementations of step S405 of the above-described embodiment shown in fig. 4, the target material information is determined from the alternative material information based on the following rules:

determining alternative material information meeting at least one of the following conditions as target material information: the alternative material information comprises a target person, and the confidence coefficient of the target person is greater than a first preset threshold value and/or the overall confidence coefficient is greater than a second preset threshold value; and the candidate material information comprises target events, and the confidence coefficient of the target events is greater than a third preset threshold value and/or the overall confidence coefficient is greater than a fourth preset threshold value.

The confidence of the target material information determined by the optional implementation mode is high, and therefore the accuracy of the generated highlight video is high.

The first preset threshold, the second preset threshold, the third preset threshold and the fourth preset threshold can be set by a person skilled in the art according to actual conditions, and the value can be determined as long as the determined target material information is accurate as much as possible.

Or the first preset threshold, the second preset threshold, the third preset threshold and the fourth preset threshold are determined based on the duration of the highlight video to be generated. That is, the longer the duration of the highlight video to be generated is, the more the required target segments and target material information are, and the lower the value of the preset threshold is correspondingly. The shorter the duration of the highlight video to be generated is, the higher the value of the preset threshold value is correspondingly.

By the arrangement, the time length of the generated highlight video is controllable, and the target material information is the part with higher confidence coefficient whenever the target material information is determined, so that the accuracy of the generated highlight video is higher.

Referring to fig. 5, fig. 5 shows a flow 500 of another embodiment of a method of generating a highlight video according to the present application. The method for generating the highlight video comprises the following steps:

s501, a keyword of a title of an input video is acquired.

S502, acquiring the attention object and the start-stop time of the attention object, and the attention event and the start-stop time of the attention event according to the images of the frames; and or acquiring the attention event and the start-stop time of the attention event, the hotspot and the start-stop time of the hotspot according to the audio frequency of each frame.

In this embodiment, the execution main body of the method for generating the highlight video may obtain the attention object and the start-stop time of the attention object, and the attention event and the start-stop time of the attention event according to the image of each frame; and/or acquiring the attention event and the start-stop time of the attention event, and the start-stop time of the hotspot according to the audio frequency of each frame.

Through the step, the attention events can be obtained according to the images, the audios or both, the information sources are wide, the information to be fused is more comprehensive, and a plurality of attention events from different sources are mutually verified, so that the generated alternative material information is more comprehensive. And the corresponding collection attention information is acquired through the audio and/or image of each frame, so that the collection attention information can be synchronously acquired from multiple dimensions, the time for acquiring the collection attention information is shortened, and the efficiency is improved.

S503, determining information to be fused based on the attention object and the start-stop time of the attention object, the attention event and the start-stop time of the attention event, the hotspot and the start-stop time of the hotspot, and the keyword of the title.

In this embodiment, the execution subject of the method for generating the highlight video may determine the information to be fused based on the attention object and the start-stop time of the attention object, the attention event and the start-stop time of the attention event, the hotspot and the start-stop time of the hotspot, and the keyword of the title.

The operations of the steps S501 to S503 are substantially the same as the operations of the steps S301 to S303 in the embodiment shown in fig. 3, and are not described again here.

S504, inputting the information to be fused into the trained information fusion model, and outputting a plurality of pieces of alternative material information including characters, events and start-stop time by the information fusion model.

Illustratively, the information fusion model may be a text information fusion model that may fuse at least two texts including the same element into one text, outputting alternative material information including a person, an event, and a start-stop time. For example, an X-time Y person, X-time Z event may be fused into: and outputting the X-time Y-person Z event as alternative material information.

And S505, determining target material information from the alternative material information, intercepting videos in a time period corresponding to the target material information from the input videos as target segments, and splicing the target segments into a centralized video.

Because the alternative material information comprises the start-stop time, after the target material information is determined, the target segment corresponding to the target material information can be intercepted from the input video according to the start-stop time in the target material information, and then the target segment is spliced to obtain the collection video. Through the steps, the corresponding collection video can be conveniently and rapidly generated according to the user requirements.

The operation of this step is substantially the same as the operation of step S405 in the embodiment shown in fig. 4, and is not described again here.

And S506, generating a title and/or an abstract of the highlight video according to the target material information and the keywords of the title.

In this embodiment, the execution main body of the method for generating the highlight video may generate the title and/or the abstract of the highlight video according to the target material information and the keyword of the title.

The generated title and/or abstract may include keywords of the title, and a target person and/or a target event in the target material information. For example, the game may be "a country-B country match a celestial star highlight time collection".

Through the step, the generated title and/or abstract of the collection video do not need to be marked manually, so that the labor is saved. And the user can conveniently search for interested contents through the title of the highlight video, and the pertinence and the user experience are better.

According to the method for generating the highlight video, information to be fused is determined based on the acquired keywords of the title, the start and stop times of the attention object and the attention object, the start and stop times of the attention event and the attention event, and the start and stop times of the hotspot and the hotspot, then, alternative material information is obtained according to the information to be fused, and the highlight video is generated according to the alternative material information. The information to be fused is derived from multiple dimensions of the input video, so that the dimension of the generated alternative material information is high, accurate and comprehensive, different types of collection videos can be generated, titles and abstracts can be generated, and diversified requirements of users can be met. And the input video can be analyzed in parallel from multiple dimensions at the same time, and the video processing time length is shortened.

Referring to fig. 6, fig. 6 shows a schematic diagram 600 of one scene of a method of generating a highlight video according to the present application. In this scenario, the method for generating the highlight video may be applied to a ball game video played by a player, and at this time, the input video is the ball game video played by the player. At this time, the player 601 establishes a connection with the server 602, a system for generating the highlight video runs in the server 602, the system for generating the highlight video runs a program for executing the method for generating the highlight video, and the server 602 may execute the following steps:

s6021, obtaining the keywords of the title of the ball game video played by the player.

And S6022, acquiring the highlight attention information according to the image and/or the audio of each frame of the ball game video played by the player.

And S6023, determining information to be fused based on the highlight attention information and the keywords of the title.

And S6024, inputting the information to be fused into the trained information fusion model, wherein the information fusion model outputs a plurality of pieces of alternative material information including characters, events and start-stop time.

And S6025, determining target material information from the alternative material information, capturing videos in a time period corresponding to the target material information from the ball game videos as target segments, and splicing the target segments into a collection video.

In this scenario, the title of the ball game may be "2014 world cupman duet germany VS brazil", and keywords of the title may be extracted as follows: "race time: 2014. the type of game is: names of two teams of a semi-finale, a game: germany, brazil. The time of the attention object and the time of the attention object may be "T1 to T2 time, a certain player in a certain team, a certain number". The event of interest and the start-stop time of the event of interest may be: and time T1-T2, goal. The hot spot and the starting and ending time of the hot spot can be T1-T2 time, hot spot. The alternative material information generated accordingly is: "time T1-T2, a certain player in a certain team to take a certain number of goals".

And determining target material information from the multiple pieces of alternative material information according to a preset rule, and further generating the highlight video.

The method for generating the highlight video comprises the steps of firstly obtaining keywords of a title according to a title of an input video, obtaining highlight attention information according to images and/or audio of frames, then determining information to be fused based on the highlight attention information and the keywords of the title, finally obtaining alternative material information according to the information to be fused and generating the highlight video according to the alternative material information. The information to be fused is derived from multiple dimensions of the input video, so that the dimension of the generated alternative material information is high, accurate and comprehensive, different types of collection videos can be generated accordingly, and diversified requirements of users are met.

Referring to fig. 7, fig. 7 illustrates a structure 700 of an embodiment of an apparatus for training an information fusion model according to the present application.

The device for training the information fusion model comprises:

a title keyword acquisition module 701 configured to acquire a keyword of a title of an input video;

a highlight attention information obtaining module 702 configured to obtain highlight attention information according to an image and/or audio of each frame of an input video;

the information to be fused determining module 703 is configured to determine information to be fused based on the highlight attention information and the keyword of the title;

and the model training module 704 is configured to use the information to be fused as the input of the information fusion model, use the alternative material information including the characters, the events and the start-stop time, which is acquired from the input video, as the expected output of the information fusion model corresponding to the information to be fused, train the initial model of the information fusion model, and obtain the trained information fusion model.

In this embodiment, in the apparatus 700 for training an information fusion model, specific processes of the title keyword obtaining module 701, the focused attention information obtaining module 702, the information to be fused determining module 703, and the model training module 704 and technical effects thereof may respectively refer to the related descriptions of steps S201 to 204 in the embodiment shown in fig. 2, and are not repeated herein.

In some optional implementations of this embodiment, the highlight attention information obtaining module 702 includes at least one of the following:

a first acquisition sub-module (not shown in the figure) configured to acquire a start-stop time of an attention object and an attention object, and a start-stop time of an attention event and an attention event from images of respective frames;

and a second obtaining sub-module (not shown in the figure) configured to obtain the attention event and the start-stop time of the attention event, the hotspot and the start-stop time of the hotspot according to the audio of each frame.

In some optional implementations of this embodiment, the first obtaining unit is further configured to: inputting the image of each frame into a preset face detection model, and outputting a face frame and the start-stop time of the face frame by the face detection model; inputting the image of each frame into a preset human body detection model, and outputting a human body detection frame and the start-stop time of the human body detection frame by the human body detection model; and determining the start-stop time of the attention object and the attention object based on the start-stop time of the face frame and the start-stop time of the human body detection frame and the human body detection frame.

The device for training the information fusion model provided by this embodiment first obtains keywords of a title according to a title of an input video, obtains highlight attention information according to an image and/or audio of each frame, then determines information to be fused based on the highlight attention information and the keywords of the title, and finally trains the information fusion model according to the information to be fused and alternative material information. The information to be fused is derived from multiple dimensions of the input video, and the information of the alternative materials is relatively comprehensive, so that the trained information fusion model can generate accurate and comprehensive information of the alternative materials according to the information to be fused of the multiple dimensions.

Referring to fig. 8, fig. 8 shows a structure 800 of an embodiment of an apparatus for generating a highlight video according to the present application.

The device for generating the collection video comprises:

a title keyword acquisition module 801 configured to acquire keywords of a title of an input video;

a highlight attention information acquisition module 802 configured to acquire highlight attention information from images and/or audio of each frame of an input video;

a to-be-fused information determining module 803 configured to determine to-be-fused information based on the highlight attention information and the keyword of the title;

the alternative material information generating module 804 is configured to input information to be fused into the trained information fusion model, and the information fusion model outputs a plurality of pieces of alternative material information including characters, events and start-stop time;

and the highlight video generation module 805 is configured to determine target material information from the alternative material information, intercept videos in a time period corresponding to the target material information from the input video as target segments, and splice the target segments into a highlight video.

In this embodiment, in the apparatus 800 for generating a highlight video, specific processes of the title keyword obtaining module 801, the highlight attention information obtaining module 802, the information to be fused determining module 803, the alternative material information generating module 804, and the highlight video generating module 805, and technical effects thereof may respectively refer to the relevant descriptions of steps S401 to 405 in the embodiment shown in fig. 4, and are not described herein again.

In some optional implementations of this embodiment, the highlight attention information obtaining module 802 includes at least one of the following:

In some optional implementations of this embodiment, the apparatus for generating a highlight video further includes:

and a title generation module (not shown in the figure) configured to generate the title and/or the abstract of the highlight video according to the target material information and the keywords of the title.

According to the device for generating the highlight video, information to be fused is determined based on the acquired keywords of the title, the start and stop times of the attention object and the attention object, the start and stop times of the attention event and the attention event, and the start and stop times of the hotspot and the hotspot, then, alternative material information is obtained according to the information to be fused, and the highlight video is generated according to the alternative material information. The information to be fused is derived from multiple dimensions of the input video, so that the dimension of the generated alternative material information is high, accurate and comprehensive, different types of collection videos can be generated accordingly, and diversified requirements of users are met.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 9 is a block diagram of an electronic device for training an information fusion model or a method for generating a highlight video according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 9, the electronic apparatus includes: one or more processors 901, memory 902, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 9 illustrates an example of a processor 901.

Memory 902 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method of training an information fusion model or generating a highlight video provided herein. A non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform a method of training an information fusion model, or generating a highlight video, as provided herein.

The memory 902, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the training information fusion model or the method for generating a highlight video in the embodiment of the present application (for example, the title keyword acquisition module 701, the attention information acquisition module 702, the information to be fused determination module 703, and the model training module 704 shown in fig. 7, or the title keyword acquisition module 801, the attention information acquisition module 802, the information to be fused acquisition module 803, the alternative material information generation module 804, and the highlight video generation module 805 shown in fig. 8). The processor 901 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 902, namely, implementing the method for training the information fusion model or generating the highlight video in the above method embodiments.

The memory 902 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created from the training information fusion model, or the use of the electronic device that generates the highlight video, or the like. Further, the memory 902 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 902 may optionally include a memory remotely located from the processor 901, which may be connected to an electronic device that trains the information fusion model, or generates the highlight video, over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device for training the information fusion model or the method for generating the highlight video may further include: an input device 903 and an output device 904. The processor 901, the memory 902, the input device 903 and the output device 904 may be connected by a bus or other means, and fig. 9 illustrates the connection by a bus as an example.

The input device 903 may receive input numeric or character information and generate key signal inputs related to training the information fusion model, or user settings and function control of the electronic device that generates the highlight video, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick, and the like. The output devices 904 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the conventional physical host and Virtual Private Server (VPS) service.

Artificial intelligence is the subject of studying computers to simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural voice processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of training an information fusion model, comprising:

acquiring keywords of a title of an input video;

acquiring focused attention information according to images and/or audio of each frame of an input video;

determining information to be fused based on the collection attention information and the keywords of the title;

and taking the information to be fused as the input of an information fusion model, taking the alternative material information which is acquired from an input video and comprises characters, events and start-stop time as the expected output of the information fusion model corresponding to the information to be fused, and training an initial model of the information fusion model to obtain the trained information fusion model.

2. The method of claim 1, wherein the obtaining of highlight attention information from images and/or audio of frames of the input video comprises at least one of:

acquiring the attention object and the start-stop time of the attention object, and the attention event and the start-stop time of the attention event according to the images of the frames;

and acquiring the attention event and the start-stop time of the attention event, and the start-stop time of the hotspot according to the audio of each frame.

3. The method of claim 2, wherein the acquiring of the object of interest and the start-stop time of the object of interest from the images of the frames comprises:

inputting the image of each frame into a preset face detection model to obtain a face frame output by the face detection model and start-stop time of the face frame;

inputting the image of each frame into a preset human body detection model to obtain a human body detection frame output by the human body detection model and start-stop time of the human body detection frame;

and determining the start-stop time of the attention object and the start-stop time of the attention object based on the start-stop time of the face frame and the start-stop time of the human body detection frame and the human body detection frame.

4. The method of claim 2, the obtaining of the event of interest and the start-stop time of the event of interest from the images of the frames comprising at least one of:

inputting the image of each frame into a preset subtitle detection model to obtain subtitle information output by the subtitle detection model, and filtering and extracting key information of the subtitle information to obtain an attention event and the starting and ending time of the attention event;

inputting each frame image into a preset action detection model to obtain an attention event output by the action detection model and the starting and ending time of the attention event.

5. The method of claim 2, wherein the obtaining of the event of interest and the start-stop time of the event of interest from the audio of each frame comprises:

inputting the audio frequency of each frame into a preset voice recognition model to obtain character information which is output by the voice recognition model and corresponds to the audio frequency;

and extracting keywords from the text information to obtain the attention event and the start-stop time of the attention event.

6. The method of claim 2, wherein the obtaining of the hot spot and the start-stop time of the hot spot according to the audio of each frame comprises:

and inputting the audio of each frame into a preset boiling point detection model to obtain a hot point output by the boiling point detection model and the starting and stopping time of the hot point.

7. The method according to any one of claims 1-6, wherein the information to be fused is determined based on the highlight attention information and the keywords of the title:

determining information to be fused based on the collection attention information, the confidence of the collection attention information and the keywords of the title;

the alternative material information further includes at least one of: the confidence level of the person, the confidence level of the event, the confidence level of the start-stop time, and the overall confidence level of the alternative story information.

8. A method of generating a highlight video, comprising:

acquiring keywords of a title of an input video;

inputting the information to be fused into an information fusion model trained by the method for training the information fusion model according to any one of claims 1-8, wherein the information fusion model outputs a plurality of pieces of alternative material information including characters, events and start-stop time;

and determining target material information from the alternative material information, intercepting videos in a time period corresponding to the target material information from the input videos as target segments, and splicing the target segments into a highlight video.

9. The method according to claim 8, wherein the obtaining of highlight attention information from images and/or audio of frames of the input video comprises at least one of:

10. The method of claim 8, further comprising:

and generating a title and/or an abstract of the highlight video according to the target material information and the keywords of the title.

11. The method of claim 8, determining target material information from the alternative material information based on the following rules:

and determining the alternative story information comprising the target person and/or the target event as the target story information.

12. The method of claim 8, determining target material information from the alternative material information based on the following rules:

determining the alternative material information meeting at least one of the following conditions as target material information:

the alternative material information comprises a target person, and the confidence coefficient of the target person is greater than a first preset threshold value and/or the overall confidence coefficient is greater than a second preset threshold value;

and the candidate material information comprises target events, and the confidence coefficient of the target events is greater than a third preset threshold value and/or the overall confidence coefficient is greater than a fourth preset threshold value.

13. The method according to claim 12, wherein the first preset threshold, the second preset threshold, the third preset threshold and the fourth preset threshold are determined based on a duration of a highlight video to be generated.

14. An apparatus for training an information fusion model, comprising:

a title keyword acquisition module configured to acquire a keyword of a title of an input video;

the system comprises a highlight attention information acquisition module, a highlight attention information acquisition module and a highlight attention information processing module, wherein the highlight attention information acquisition module is configured to acquire highlight attention information according to images and/or audios of frames of an input video;

the information to be fused determining module is configured to determine information to be fused based on the highlight attention information and the keywords of the title;

and the model training module is configured to take the information to be fused as the input of an information fusion model, take the alternative material information which is acquired from an input video and comprises the characters, the events and the start-stop time as the expected output of the information fusion model corresponding to the information to be fused, train an initial model of the information fusion model and obtain the trained information fusion model.

15. The apparatus of claim 14, the highlight attention information acquisition module comprising at least one of:

the first acquisition sub-module is configured to acquire the attention object and the start-stop time of the attention object, and the attention event and the start-stop time of the attention event according to the images of the frames;

and the second acquisition sub-module is configured to acquire the attention event and the start-stop time of the attention event, the hotspot and the start-stop time of the hotspot according to the audio of each frame.

16. An apparatus for generating a highlight video, comprising:

the alternative material information generation module is configured to input the information to be fused into an information fusion model trained by the method for training the information fusion model according to any one of claims 1 to 7, and the information fusion model outputs a plurality of pieces of alternative material information including characters, events and start-stop time;

and the highlight video generation module is configured to determine target material information from the alternative material information, intercept videos in a time period corresponding to the target material information from the input video as target segments, and splice the target segments into a highlight video.

17. The apparatus of claim 16, the highlight attention information acquisition module comprising at least one of:

a first acquisition sub-module configured to acquire an attention object and start-stop times of the attention object, a first attention event and start-stop times of the first attention event according to the images of the frames;

and the second obtaining sub-module is configured to obtain the second attention event and the start-stop time of the second attention event, the hotspot and the start-stop time of the hotspot according to the audio of each frame.

18. The apparatus of claim 16, the apparatus further comprising:

and the title generation module is configured to generate a title and/or an abstract of the highlight video according to the target material information and the keywords of the title.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7 or 8-13.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-7 or 8-13.