CN110166650B

CN110166650B - Video set generation method and device, computer equipment and readable medium

Info

Publication number: CN110166650B
Application number: CN201910355708.1A
Authority: CN
Inventors: 刘霄; 李鑫; 李甫; 何栋梁; 龙翔; 张赫男; 孙昊; 文石磊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2022-08-23
Anticipated expiration: 2039-04-29
Also published as: CN110166650A

Abstract

The invention provides a method and a device for generating a video set, computer equipment and a readable medium. The method comprises the following steps: acquiring a plurality of videos related to a specified entity based on a pre-established knowledge graph; according to a pre-trained motion detection model, cutting a plurality of video segments containing specified motions from the plurality of videos; stitching together the plurality of video segments containing the specified action to obtain the video set. By adopting the technical scheme, the invention provides a scheme for efficiently and automatically generating the video set. According to the technical scheme, the video set is generated based on the knowledge graph and the AI, the accuracy of the clipped video segments and the accuracy of the generated video set can be effectively guaranteed, manual clipping is not needed in the generation process, and the generation efficiency of the video set is very high.

Description

Video set generation method and device, computer equipment and readable medium

[ technical field ] A

The present invention relates to the field of computer application technologies, and in particular, to a method and an apparatus for generating a video set, a computer device, and a readable medium.

[ background of the invention ]

With the rapid development of multimedia and internet, video becomes an indispensable information acquisition mode in user life. The user can not only learn new knowledge through videos, but also watch various travel videos, entertainment videos and the like at any time and any place so as to enjoy leisure and entertainment time.

In the prior art, video resources are very rich, the information content in each video is very large, and even one movie and television actor can correspondingly shoot many movie and television works. Therefore, if a user wants to cut a video set containing a common specified action from a large number of videos included in the video library, the user needs to browse each video and manually cut a video segment containing the specified action from each video containing the specified action. Each video segment containing the specified action is then manually concatenated together, generating a video set.

As can be seen from the above, the generation of the existing video set is very time-consuming and labor-consuming, and due to manual editing, the video editing precision is not high, so that there may be video segments other than the specified action in the edited video set or the video segments of the specified action are not completely edited. Therefore, it is desirable to provide an efficient generation scheme for video sets.

[ summary of the invention ]

The invention provides a method and a device for generating a video set, computer equipment and a readable medium, which are used for providing an efficient generation scheme of the video set.

The invention provides a method for generating a video set, which comprises the following steps:

acquiring a plurality of videos related to a specified entity based on a pre-established knowledge graph;

according to a pre-trained motion detection model, cutting a plurality of video segments containing specified motions from the plurality of videos;

stitching together the plurality of video segments that contain the specified action, resulting in the video set.

The invention provides a video set generation device, which comprises:

the acquisition module is used for acquiring a plurality of videos related to the specified entity based on a pre-established knowledge graph;

the clipping module is used for clipping a plurality of video segments containing specified actions from the plurality of videos according to a pre-trained action detection model;

a splicing module, configured to splice the plurality of video segments including the specified action together to obtain the video set.

The present invention also provides a computer apparatus, the apparatus comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of generating a video set as described above.

The invention also provides a computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method of generating a video set as described above.

By adopting the technical scheme, the method and the device for generating the video set, the computer equipment and the readable medium can overcome the technical problems that the time and the labor are consumed for manually generating the video set and the generated video set is not high in precision in the prior art, and provide a scheme for efficiently and automatically generating the video set. According to the technical scheme, the video set is generated based on the knowledge graph and the AI, the accuracy of the clipped video segments and the accuracy of the generated video set can be effectively guaranteed, manual clipping is not needed in the generation process, and the generation efficiency of the video set is very high.

[ description of the drawings ]

Fig. 1 is a flowchart of a first embodiment of a method for generating a video set according to the present invention.

Fig. 2 is a flowchart of a second embodiment of a method for generating a video set according to the present invention.

Fig. 3 is a flowchart of a third embodiment of a method for generating a video set according to the present invention.

Fig. 4 is a schematic structural diagram of a time convolution network according to the present invention.

Fig. 5 is a block diagram of a first embodiment of a video set generation apparatus according to the present invention.

Fig. 6 is a block diagram of a second embodiment of the video album generating apparatus according to the present invention.

Fig. 7 is a block diagram of a third embodiment of the video album generating apparatus according to the present invention.

Fig. 8 is a block diagram of a fourth embodiment of the video album generating apparatus according to the present invention.

FIG. 9 is a block diagram of an embodiment of a computer device of the present invention.

Fig. 10 is an exemplary diagram of a computer device provided by the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

Artificial Intelligence (AI) is a new technical science for researching and developing theories, methods, techniques and application systems for simulating, extending and expanding human Intelligence. Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, a field of research that includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others. Based on AI techniques, various neural network models can be employed to achieve various applications.

The Knowledge map (also called scientific Knowledge map) is a Knowledge domain visualization or Knowledge domain mapping map in the book intelligence world, and is a series of different graphs for displaying the relation between the Knowledge development process and the structure, and the Knowledge resources and the carriers thereof are described by using the visualization technology, and the Knowledge and the interrelation among the Knowledge resources, the carriers, the analysis, the construction, the drawing and the display are mined, analyzed, constructed, drawn and displayed. The knowledge graph combines theories and methods of subjects such as applied mathematics, graphics, information visualization technology, information science and the like, and methods of metrology citation analysis, co-occurrence analysis and the like, and utilizes the visualized graph to vividly show the core structure, development history, frontier field and overall knowledge framework of the subjects to achieve the modern theory of multi-subject fusion, thereby providing practical and valuable reference for subject research. The existing data in various fields can store all entities, the relation among the entities, the attributes and the attribute values of the entities and the like in the data by constructing a knowledge graph, so that the requirement of business is facilitated.

The invention realizes the generation of the video set based on the AI and the knowledge graph, provides an efficient generation scheme of the video set, and can realize the generation of the video set in batch and large scale.

Fig. 1 is a flowchart of a first embodiment of a method for generating a video set according to the present invention. As shown in fig. 1, the method for generating a video set according to this embodiment may specifically include the following steps:

s100, acquiring a plurality of videos related to a specified entity based on a pre-established knowledge graph;

the executing body of the method for generating a video set in this embodiment may be a device for generating a video set, and the device for generating a video set may be an electronic entity, or may also be an application integrated through software.

In this embodiment, the pre-established knowledge graph may refer to an existing construction method of the knowledge graph to construct the knowledge graph in the video field. The structure of the knowledge-graph may include a plurality of entities, relationships between entities, attributes and attribute values of the entities, and the like. In the knowledge graph in the video field, the main entity may be the name of the video, and if the video is a person-like video, the related entities may include the names of actors, characters, director, producer, and theme songs in the video. In the case of an animal video, the related entities may include the type of animal in the video, and so on.

The designated entity of the present embodiment may be a human entity, an animal entity, or other types of entities. If the designated entity is an entity corresponding to certain designated personal information, the designated action in this embodiment may be any action that can be detected, for example, actions such as kissing, fighting, and racing car. If the designated entity is an entity corresponding to certain designated animal information, the designated action of the embodiment may be any action of the animal, for example, actions such as running, walking a single-log bridge, drilling a fire circle, eating, and the like. Similarly, the designated entity in this embodiment may also be other stars such as the earth, the moon, and the sun, and the designated action may also be designated as a whole-day meal, a partial-day meal, a lunar meal, and so on. Or the designated entity of this embodiment may also be another entity, and the designated action may also be that the designated entity can complete another action, which is not described in detail herein one by one.

Before step S100, the video set generating apparatus may receive a user-triggered video set generating request, where the video set generating request may carry a specified entity and a specified action, so as to request to generate a video containing the specified action and related to the specified entity.

Correspondingly, in step S100, after receiving the request for generating the video set, the device for generating the video set may obtain a plurality of video information, such as video names, corresponding to the designated entity according to the relationship between the entities in the pre-established knowledge graph; and then, obtaining a video corresponding to each piece of video information from a video library to obtain a plurality of videos.

S101, according to a pre-trained motion detection model, cutting a plurality of video segments containing specified motions from a plurality of videos;

in this embodiment, a video segment of a specified action in a plurality of videos can be identified by using a pre-trained action detection model, and then a plurality of video segments containing the specified action can be clipped from the plurality of videos. Specifically, a certain video of the plurality of videos may not contain a video clip of the specified action, or may include one, two, or more video clips containing the specified action. In summary, a plurality of video clips containing a specified action can be clipped from a plurality of videos. The number of video segments may be more than the number of videos or less than the number of videos.

In this embodiment, one motion detection model can recognize only one specific motion. If other specified actions are to be detected, other action detection models need to be retrained.

Specifically, when a plurality of video segments containing a specified motion are clipped from a plurality of videos according to a motion detection model trained in advance, each video may be input into the motion detection model, and the video segment of the specified motion in the video may be directly output by the motion detection model. It should be noted that before this step is performed, the motion detection model needs to be trained in a similar manner. Before training, a plurality of training videos and video clips containing specified actions in each training video are acquired, during training, each acquired training video is input into an action detection model, the action detection model outputs the video clips containing the specified actions in the training video, then a loss function is constructed according to the predicted starting point and the predicted stopping point of the video clip containing the specified actions and the real starting point and the predicted stopping point of the video clip containing the specified actions, whether the loss function is converged (if the loss function is smaller than a preset threshold value) is judged, and if the loss function is not converged, parameters of the action detection model are adjusted, so that the video clip containing the specified actions predicted by the action detection model and the real video clip containing the specified actions tend to be consistent. And continuously training the motion detection model by adopting a plurality of training videos and video clips containing the specified motion in each training video according to the mode until the loss function is converged, and determining the parameters of the motion detection model so as to determine the motion detection model, wherein the motion detection model is completely trained.

The motion detection model trained according to the above embodiment is used by inputting a video into the motion detection model, and if the video includes a video segment of a specified motion, the motion detection model can directly output the video segment including the specified motion, so that a plurality of video segments including the specified motion can be edited from a plurality of videos by processing the plurality of videos.

In the above technical solution, the motion detection model identifies the specified motion in the video by taking the video as granularity, and further clips a plurality of video segments containing the specified motion from the plurality of videos. In this embodiment, each video clip includes only one specified action.

And S102, splicing a plurality of video clips containing the specified actions together to obtain a video set.

Specifically, a plurality of video clips containing the specified action are spliced together in a first-order concatenation mode to obtain a video set containing the specified action and required by a user and related to a specified entity. When a plurality of video clips are spliced, the video clips can be randomly spliced without a fixed sequence.

By adopting the technical scheme, the method for generating the video set can overcome the technical problems that the time and the labor are consumed for manually generating the video set and the generated video set is not high in precision in the prior art, and provides a scheme for efficiently and automatically generating the video set. According to the technical scheme, the video set is generated based on the knowledge graph and the AI, the accuracy of the clipped video segments and the accuracy of the generated video set can be effectively guaranteed, manual clipping is not needed in the generation process, and the generation efficiency of the video set is very high.

Fig. 2 is a flowchart of a second embodiment of a method for generating a video set according to the present invention. As shown in fig. 2, the method for generating a video set according to this embodiment is based on the technical solution of the embodiment shown in fig. 1, and the technical solution of the present invention is described in detail. As shown in fig. 2, the method for generating a video set according to this embodiment may specifically include the following steps:

s200, receiving a video set generation request carrying a designated entity and a designated action and input by a user through a man-machine interface module;

for example, the user may send a request for generating a video set carrying the actor a and the specified action to the video set generating device through the human-machine interface module to request to acquire the video set of the specified action completed by the actor a. Or the user can send a generation request of the video set carrying the director B and the specified action to the generation device of the video set through the human-computer interface module so as to request to acquire the video set of the specified action directed by the director. Or the user can send a generation request of the video set carrying the dog and the bone gnawing to a generation device of the video set through the man-machine interface module so as to request to acquire the video set of the bone gnawing of the dog.

The man-machine interface module of the embodiment can be a mouse, a keyboard, a touch screen, or a microphone and the like which can receive a request initiated by a user in a voice form.

S201, acquiring a plurality of video information corresponding to a specified entity according to a relationship between the entities in a pre-established knowledge graph;

s202, acquiring a video corresponding to each piece of video information from a video library to obtain a plurality of videos;

the steps S201 and S202 are a specific implementation form of the step S100 in the embodiment shown in fig. 1, and reference may be made to the description of the embodiment shown in fig. 1 for details, which are not repeated herein.

S203, extracting the images of each frame according to the time sequence of each video to obtain a group of image sequences;

s204, predicting a starting point and an end point of a designated action in a corresponding video according to each group of image sequences and a pre-trained action detection model;

s205, according to the starting point and the ending point of each appointed action, corresponding video segments containing the appointed actions are clipped from the corresponding video, and a plurality of video segments are obtained in total;

the above steps S203-S205 are an implementation manner of the step S101 of the embodiment shown in fig. 1.

Specifically, the motion detection model in this embodiment takes the image of each frame in the video as the granularity, identifies the specified motion in the video, and further clips a plurality of video segments containing the specified motion from the plurality of videos.

In this scheme, the motion detection model realizes the recognition of the designated motion by detecting the image. Since the designated action has a certain duration, in this embodiment, the start point and the end point of the designated action in the video can be predicted by identifying each frame image in the image sequence of the video, and then the video segment containing the designated action is clipped based on the start point and the end point of the designated action.

For example, in a specific implementation, each image in each group of image sequences may be input into the motion detection model, and the motion detection model predicts a probability that the corresponding image is a starting point of a specified motion and a probability that the corresponding image is an ending point of the specified motion; and then acquiring the time corresponding to the image of which the probability of the starting point of the specified action is greater than the preset probability threshold in each group of image sequences as the starting point of the specified action in the corresponding video, and acquiring the time corresponding to the image of which the probability of the ending point of the specified action is greater than the preset probability threshold as the ending point of the specified action in the corresponding video. The preset probability threshold of this embodiment may be set empirically, and may be a probability value greater than 0.5 and less than 1, for example.

That is, for each sequence of images of the video, one image at a time can be selected in a front-to-back order and input into the motion detection model. Correspondingly, the action detection model outputs two corresponding probability values of the image, namely the probability value of the image being the starting point of the designated action and the probability value of the image being the ending point of the designated action. And if the probability value of the starting point or the probability value of the ending point is greater than a preset probability threshold, the image is considered as the corresponding starting point or ending point. In practical application, an image cannot be a starting point and an end point, so that the situation that the probability value of the starting point and the probability value of the end point are simultaneously larger than the preset probability threshold value cannot exist in practical application. In this way, for each image sequence of the video, the motion detection model is input for each image from front to back one by one, an image corresponding to the starting point of the specified motion can be predicted, and then the starting point and the ending point of the specified motion can be determined according to the time of the image. Continuing with the detection, in a similar manner, the termination point of the specified action may be determined. In the detection process, the start point and the end point exist in pairs. In one video, only one pair of start and stop points and end points may be included, or multiple pairs of start and stop points and end points may be included.

According to the embodiment, the start points and the end points of a plurality of specified actions can be acquired, and for each pair of start points and end points in each acquired video, corresponding video segments containing the specified actions can be clipped from the corresponding video, so that a plurality of video segments containing the specified actions can be obtained in total.

Alternatively, in practical applications, if the user requests a video set of the specified action performed by the actor a, although all videos played by the actor a may be acquired according to steps S201 and S202, the videos may also include other actors that have performed the specified action, so that the video segment acquired in step S205 may not be the video segment of the specified action performed by the actor a. Therefore, after step S203 and before step S204 of the above embodiment, the following cases may be included:

if the designated entity is designated personal information, carrying out face detection on images in each group of image sequences based on the designated personal information, and deleting images which do not comprise the designated personal information in each group of image sequences; therefore, the video clips of the specified actions obtained subsequently are all the video clips comprising the specified character information, and the accuracy of the obtained video clips can be improved.

Similarly, if the designated entity is the designated animal information, feature detection is performed on the images in each group of image sequences based on the designated animal information, and the images which do not include the designated animal information in each group of image sequences are deleted, so that video clips including the designated animal information in subsequently acquired video clips of the designated action can be ensured, and the accuracy of the acquired video clips can be improved.

Alternatively, after step S205, a plurality of video segments may be obtained in total, face detection may be performed on the plurality of video segments obtained by clipping and containing the specified action based on the specified personal information, and a video segment not containing the specified personal information may be deleted; the specified action in the retained video segment is regarded as being completed by the person corresponding to the specified personal information.

Similarly, if the designated entity is the designated animal information, feature detection may be performed on the plurality of video segments including the designated action obtained by clipping based on the designated animal information, and the video segment not including the designated animal information in the plurality of video segments is deleted, and all the designated actions in the remaining video segments are considered to be completed by the animal corresponding to the designated animal information.

The face detection method of this embodiment may be: and presetting a plurality of face templates corresponding to the appointed figure information, extracting faces in each frame of image of the video during detection, then matching the extracted faces with the face templates corresponding to the preset appointed figure information, if the similarity is greater than a preset threshold value, considering the faces in the video as the faces corresponding to the appointed figure information, otherwise, not judging the faces. Or in this embodiment, the face detection model corresponding to the designated person information may be trained, and a plurality of face images of the person are collected in advance to train the face detection model, so that the face detection model can accurately identify the face of the designated person. When the method is used, each frame of image of a video is input into the face detection model, and the face detection model can predict whether the face of the specified person is included in the video or not.

And if the designated entity is designated animal information, detecting by adopting a characteristic detection mode. It should be noted that the animal is different and the collected features may be different, and in particular, the image of the feature template may be selected to include more obvious features that distinguish the given animal from other animals. For example, in practical applications, some animals can distinguish the animal types through animal head portraits, and a plurality of head portraits are collected as preset feature templates. If the animal body cannot be distinguished only by the head portrait, other feature information of the animal body can be added, and the image of the acquired feature template needs to include not only the head portrait but also other feature information. The specific detection process is the same as the principle of the face detection. In the same way, the feature detection mode may also be implemented by using the feature detection model of the designated animal, and the implementation principle is the same as that of the face detection model, and the description of the above embodiment is referred to in detail, and is not repeated here.

S206, according to a preset splicing rule, splicing a plurality of video clips containing the specified action together to obtain a video set.

In this embodiment, a preset splicing rule may be preset, for example, a preset splicing rule may be set, where the preset splicing rule may be according to a sequence from short to long or from long to short of a duration of each video clip, or may be according to a sequence from far to near or from near to far of a video showing date corresponding to each video clip, or may also be according to another preset splicing rule, or may also be according to the rule of random splicing in the embodiment shown in fig. 1, to splice a plurality of video clips including a specified action together to obtain a video set.

The method for generating a video set according to the embodiment provides a scheme for generating a video set efficiently and automatically by using the above technical scheme. According to the technical scheme, the video set is generated based on the knowledge graph and the AI, the accuracy of the clipped video segments and the accuracy of the generated video set can be effectively guaranteed, manual clipping is not needed in the generation process, and the generation efficiency of the video set is very high.

Fig. 3 is a flowchart of a third embodiment of a method for generating a video set according to the present invention. As shown in fig. 3, the method for generating a video set according to this embodiment introduces details of a training process of a motion detection model adopted in the embodiment shown in fig. 2 on the basis of the technical solution of the embodiment shown in fig. 2. The method for generating a video set according to this embodiment may specifically include the following steps:

s300, collecting a plurality of training video segments including specified actions, and marking real starting points and real ending points of the specified actions in all the training video segments;

s301, training the motion detection model according to the plurality of training video segments and the real starting point and the real ending point of the designated motion marked in each training video segment.

The execution subject of the method for generating a video set according to the present embodiment may be implemented by a video set generation apparatus, in accordance with fig. 1 and 2 described above. That is, the motion detection model is trained by the video set generating device, and then the video set generating device generates a video set based on the trained motion detection model and the knowledge graph by using the technical solution of the embodiment shown in fig. 2.

Alternatively, the execution subject of the method for generating a video set according to the present embodiment may be a training device of a motion detection model independent of the generation device of a video set, different from the execution subject of the embodiment shown in fig. 1 and 2. When the method is used specifically, the motion detection model is trained by the training device of the motion detection model, then the trained motion detection model and the pre-established knowledge graph are directly called by the generation device of the video set when the video set is generated, and the video set is generated by adopting the technical scheme of the embodiment shown in the figure 2.

Before the training of the motion detection model of this embodiment, a plurality of training video segments including the specified motion need to be collected, and the real start point and the real end point of the specified motion in each training video segment need to be labeled to serve as references for subsequently adjusting parameters of the motion detection model.

During the specific training, step S301 may specifically include the following steps:

(a) for each training video clip, extracting images of each frame according to the time sequence to obtain a group of training image sequences;

(b) predicting a prediction starting point and a prediction ending point of a designated action in a corresponding training video segment according to each group of training image sequences and the action detection model;

the implementation manners of steps (a) and (b) are the same as the implementation manners of step S203 and step S204 in the embodiment shown in fig. 2, and reference may be made to the related descriptions of the embodiment for details, which are not described herein again.

(c) Calculating a mean square error loss function according to a real starting point and a prediction starting point, a real end point and a prediction end point of a specified action in each training video segment;

(d) judging whether the mean square error loss function is converged; if not, executing step (e);

for example, in a specific implementation, a small preset threshold may be set, and it is determined whether the value of the mean square error function is smaller than the preset threshold, if so, the mean square error function is considered to be converged, otherwise, the mean square error function is considered not to be converged.

(e) Updating parameters of the motion detection model by using a gradient descent method; performing step (f);

in this embodiment, by updating the parameters of the motion detection model by using the gradient descent method, the predicted starting point and the actual starting point, and the predicted ending point and the actual ending point of the motion detection model after the parameters are updated can be closer to each other, so that the value of the mean square error loss function tends to converge.

(f) And (4) repeatedly training the motion detection model by adopting each training video segment according to the mode, namely repeating the steps (b) to (e) until the mean square error loss function tends to be convergent, determining parameters of the motion detection model, and determining the motion detection model, wherein the motion detection model is trained completely.

In this embodiment, a three-layer time convolution network may be constructed as the motion detection model. Each layer of the time convolution network has one sequence as input and one sequence of the same length as output. The value of the output sequence at each time point is determined by the input data of the current time, the previous time and the next time of the input sequence. The input of the action detection model is a characteristic sequence extracted by continuous multi-frame videos extracted by a convolutional neural network. And outputting the probability of judging whether the action is the starting point or the ending point at the moment in the uppermost layer. In the training phase, both the end point and the start point are known and can therefore be used in a way to supervise the training. Fig. 4 is a schematic structural diagram of a time convolution network according to the present invention. As shown in fig. 4, the time convolutional network may include multiple layers as shown in the figure.

By adopting the technical scheme, the method for generating the video set can train an efficient action detection model so as to generate the video set based on the action detection model and the pre-established knowledge graph in the follow-up process, and further can effectively ensure the accuracy of each video clip in the video set and the accuracy of the generated video set.

The following describes the generation process of the video set of the present application, taking as an example that a user requests to generate a video set of a known actor Q completing a fighting action.

Specifically, a video set generation device receives a generation request of a video set of a user, wherein the video set carries information of a known actor Q and fighting actions; then, the video set generating device may obtain all the titles of the movies that the known actor Q plays according to the correspondence between the known actor Q and the titles of the movies that the known actor Q plays in the pre-established knowledge map, and further may obtain the videos of all the movies that the known actor Q plays from the video library, for example, a plurality of videos may be obtained altogether.

In this embodiment, a motion detection model for identifying the start point and the end point of the fighting motion in the video may be trained in advance. Extracting images of each frame according to a time sequence for each acquired video to obtain a group of image sequences; and sequentially inputting each image in the image sequence to a pre-trained action detection model according to the sequence, and predicting the probability that the corresponding image is the starting point of the fighting action and the probability of the ending point of the fighting action by the action detection model. In the prediction results of each image in the image sequence, the moment corresponding to the image with the probability of the starting point of the fighting action being greater than the preset probability threshold value, such as 0.5, is the starting point and the ending point of the fighting action, and correspondingly, the moment corresponding to the image with the probability of the ending point of the nearest adjacent fighting action being greater than the preset probability threshold value, such as 0.5, in the image sequence is the ending point of the fighting action. Since the start point and the end point of the fighting action are paired, the corresponding video segments can be cut from the video based on the start point and the end point of the fighting action. If there are multiple video bands of the fighting action in one video, in the above manner, a video segment containing the fighting action can be clipped at each position in each video, and finally, multiple video segments containing the fighting action are obtained for the known actor Q.

However, in order to avoid the existence of the video segments not including the known actor Q among the plurality of video segments including the fighting action, after the plurality of video segments are obtained, face detection may be performed on the plurality of video segments by using a pre-established face template of the known actor Q, the video segments not including the known actor Q are deleted, only the video segments including the known actor Q are reserved, and the fighting action in the reserved video segments is considered to be completed as the known actor Q. And finally splicing the reserved video clips together according to a preset splicing rule to generate a video set of the known actor Q completing the fighting action.

The above-mentioned scene is only one application scene of this embodiment, and in practical application, in this embodiment, a video set of a specified action related to other specified entities may also be generated in other scenes, which is not described herein in detail by way of example.

Fig. 5 is a block diagram of a first embodiment of a video set generation apparatus according to the present invention. As shown in fig. 5, the apparatus for generating a video set according to this embodiment may specifically include:

the acquisition module 10 is configured to acquire a plurality of videos related to a specified entity based on a pre-established knowledge graph;

the clipping module 11 is configured to clip a plurality of video segments including a specified action from the plurality of videos acquired by the acquisition module 10 according to a pre-trained action detection model;

the splicing module 12 is configured to splice together a plurality of video segments containing the specified action and obtained by the splicing module 11, so as to obtain a video set.

Further optionally, the obtaining module 10 is specifically configured to:

acquiring a plurality of video information corresponding to the specified entity according to the relation between the entities in the knowledge graph;

and acquiring the video corresponding to each piece of video information from the video library to obtain a plurality of videos.

The implementation principle and technical effect of the apparatus for generating a video set according to this embodiment that uses the above modules to generate a video set are the same as those of the related method embodiment, and reference may be made to the description of the related method embodiment in detail, which is not described herein again.

Fig. 6 is a block diagram of a second embodiment of the video album generating apparatus according to the present invention. As shown in fig. 6, the apparatus for generating a video set according to this embodiment, based on the technical solution of the embodiment shown in fig. 5, the clipping module 11 may further include:

the extracting unit 111 is configured to extract images of each frame according to a time sequence of each video acquired by the acquiring module 10 to obtain a group of image sequences;

the prediction unit 112 is configured to predict a start point and an end point of a specified motion in a corresponding video according to each group of image sequences and motion detection models extracted by the extraction unit 111;

the clipping unit 113 is configured to clip corresponding video segments including the specified actions from the corresponding video according to the start point and the end point of each specified action predicted by the prediction unit 112, and collectively obtain a plurality of video segments.

Further optionally, the prediction unit 112 is a prediction unit, and is specifically configured to:

inputting each image in each group of image sequences into a motion detection model, and predicting the probability that the corresponding image is the starting point of the specified motion and the probability that the corresponding image is the ending point of the specified motion by the motion detection model;

and acquiring the time corresponding to the image of which the probability of the starting point of the specified action is greater than the preset probability threshold in each group of image sequences as the starting point of the specified action in the corresponding video, and acquiring the time corresponding to the image of which the probability of the ending point of the specified action is greater than the preset probability threshold as the ending point of the specified action in the corresponding video.

Further optionally, as shown in fig. 6, the apparatus for generating a video set of the present embodiment, the clipping module 11, further includes a detecting unit 114, configured to:

if the designated entity is the designated personal information, face detection is performed on the images in each group of image sequences extracted by the extraction unit 111 based on the designated personal information, and images which do not include the designated personal information in each group of image sequences are deleted;

if the designated entity is the designated animal information, the images in each group of image sequences extracted by the extraction unit 111 are subjected to feature detection based on the designated animal information, and the images in each group of image sequences that do not include the designated animal information are deleted.

At this time, correspondingly, the prediction unit 112 is configured to predict a start point and an end point of a specified motion in the corresponding video according to each set of image sequences and motion detection models processed by the detection unit 114.

Fig. 7 is a block diagram of a third embodiment of the video album generating apparatus according to the present invention. As shown in fig. 7, the apparatus for generating a video set according to this embodiment may further include the following technical solutions on the basis of the technical solutions of the embodiment shown in fig. 5.

As shown in fig. 7, the apparatus for generating a video set according to this embodiment further includes a detecting module 13, configured to:

if the designated entity is designated character information, performing face detection on the plurality of video segments obtained by the clipping module 11 based on the designated character information, and deleting the video segments which do not include the designated character information in the plurality of video segments;

if the designated entity is the designated animal information, feature detection is performed on the plurality of video segments obtained by the clipping module 11 based on the designated animal information, and the video segments which do not include the designated animal information in the plurality of video segments are deleted.

Correspondingly, the splicing module 12 is configured to splice together a plurality of video segments containing the specified action and obtained after the processing by the detection module 13, so as to obtain a video set.

The implementation principle and technical effect of the video set generation device in this embodiment by using the modules are the same as those of the related method embodiment, and details of the related method embodiment may be referred to, and are not described herein again.

Fig. 8 is a block diagram of a fourth embodiment of the video album generating apparatus according to the present invention. As shown in fig. 8, the apparatus for generating a video set according to this embodiment may specifically include:

the acquisition module 14 is configured to acquire a plurality of training video segments including specified actions, and mark real start points and real end points of the specified actions in each training video segment;

the training module 15 is configured to train the motion detection model according to the plurality of training video segments acquired by the acquisition module 14, and the real start point and the real end point of the designated motion marked in each training video segment.

Further optionally, the training module 15 is specifically configured to:

for each training video clip, extracting images of each frame according to the time sequence to obtain a group of training image sequences;

predicting a prediction starting point and a prediction ending point of a designated action in a corresponding training video segment according to each group of training image sequences and the action detection model;

calculating a mean square error loss function according to a real starting point and a predicted starting point, a real ending point and a predicted ending point of a specified action in each training video segment;

if the mean square error loss function is not converged, updating parameters of the motion detection model by using a gradient descent method;

and (4) repeatedly training the motion detection model by adopting each training video segment according to the mode until the mean square error loss function tends to be converged, and determining the parameters of the motion detection model so as to determine the motion detection model.

The video set generating device of this embodiment may exist independently, or may be combined with fig. 5, fig. 6, and fig. 7, respectively, to form an alternative embodiment of the present invention.

FIG. 9 is a block diagram of an embodiment of a computer device of the present invention. As shown in fig. 9, the computer device of the present embodiment includes: one or more processors 30, and a memory 40, the memory 40 being configured to store one or more programs, which when executed by the one or more processors 30, cause the one or more processors 30 to implement the method for generating a video set as described above in the embodiments of fig. 1-3, when the one or more programs stored in the memory 40 are executed by the one or more processors 30. The embodiment shown in fig. 9 includes a plurality of processors 30 as an example.

For example, fig. 10 is an exemplary diagram of a computer device provided by the present invention. FIG. 10 illustrates a block diagram of an exemplary computer device 12a suitable for use in implementing embodiments of the present invention. The computer device 12a shown in fig. 10 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.

As shown in FIG. 10, computer device 12a is in the form of a general purpose computing device. The components of computer device 12a may include, but are not limited to: one or more processors 16a, a system memory 28a, and a bus 18a that connects the various system components (including the system memory 28a and the processors 16 a).

Bus 18a represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12a typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12a and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28a may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30a and/or cache memory 32 a. Computer device 12a may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34a may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 10, and commonly referred to as a "hard drive"). Although not shown in FIG. 10, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18a by one or more data media interfaces. System memory 28a may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of the embodiments of the invention described above in fig. 1-8.

A program/utility 40a having a set (at least one) of program modules 42a may be stored, for example, in system memory 28a, such program modules 42a including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may include an implementation of a network environment. Program modules 42a generally perform the functions and/or methodologies described above in connection with the various embodiments of fig. 1-8 of the present invention.

Computer device 12a may also communicate with one or more external devices 14a (e.g., keyboard, pointing device, display 24a, etc.), with one or more devices that enable a user to interact with computer device 12a, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12a to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22 a. Also, computer device 12a may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) through network adapter 20 a. As shown, network adapter 20a communicates with the other modules of computer device 12a via bus 18 a. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12a, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processor 16a executes programs stored in the system memory 28a to execute various functional applications and data processing, such as generation of video sets as shown in the above-described embodiments.

The invention also provides a computer-readable medium on which a computer program is stored which, when executed by a processor, enables generation of a video set as shown in the above embodiments.

The computer-readable media of this embodiment may include RAM30a in system memory 28a, and/or cache memory 32a, and/or storage system 34a in the embodiment illustrated in fig. 10, described above.

With the development of technology, the propagation path of computer programs is no longer limited to tangible media, and the computer programs can be directly downloaded from a network or acquired by other methods. Accordingly, the computer-readable medium in the present embodiment may include not only tangible media but also intangible media.

The computer-readable medium of the present embodiments may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other division manners may be available in actual implementation.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for generating a video set, the method comprising:

under the condition that a request for generating a video set carrying a specified entity and a specified action is received, acquiring a plurality of videos related to the specified entity based on a pre-established knowledge graph;

according to a pre-trained motion detection model, cutting a plurality of video segments containing specified motions from the plurality of videos, wherein the plurality of videos are input into the motion detection model, the video segments containing the specified motions in the videos are output by the motion detection model, and one motion detection model is used for identifying one specified motion;

stitching the plurality of video segments containing the specified action together to obtain the video set;

wherein, according to a pre-trained motion detection model, after a plurality of video segments containing a specified motion are clipped from the plurality of videos, the plurality of video segments containing the specified motion are spliced together, and before the video set is obtained, the method further comprises: if the designated entity is designated figure information, carrying out face detection on the plurality of video clips based on the designated figure information, and deleting the video clips which do not include the designated figure information in the plurality of video clips; and if the designated entity is designated animal information, performing feature detection on the plurality of video clips based on the designated animal information, and deleting the video clips which do not include the designated animal information in the plurality of video clips.

2. The method of claim 1, wherein clipping a plurality of video segments containing a specified action from the plurality of videos according to a pre-trained action detection model comprises:

extracting images of each frame according to the time sequence of each video to obtain a group of image sequences;

predicting a starting point and an end point of the designated action in the corresponding video according to each group of the image sequence and the action detection model;

and according to the starting point and the ending point of each specified action, cutting a corresponding video segment containing the specified action from the corresponding video to obtain a plurality of video segments.

3. The method of claim 2, wherein predicting a starting point and an ending point of the specified motion in the corresponding video according to the respective sets of the image sequences and the motion detection model comprises:

inputting each image in each group of the image sequence into the motion detection model, and predicting the probability that the corresponding image is the starting point of the specified motion and the probability that the corresponding image is the ending point of the specified motion by the motion detection model;

and acquiring the time corresponding to the image of which the probability of the starting point of the specified action is greater than a preset probability threshold in each group of image sequences as the corresponding starting point of the specified action in the video, and acquiring the time corresponding to the image of which the probability of the ending point of the specified action is greater than the preset probability threshold as the corresponding ending point of the specified action in the video.

4. The method according to claim 2, wherein after extracting images of each frame in chronological order for each of the videos to obtain a group of image sequences, the method further comprises, according to each group of the image sequences and the motion detection model, before obtaining a start point and an end point of the specified motion in each group of the image sequences:

if the designated entity is designated personal information, carrying out face detection on images in each group of image sequences based on the designated personal information, and deleting images which do not include the designated personal information in each group of image sequences;

and if the designated entity is designated animal information, performing feature detection on the images in each group of image sequences based on the designated animal information, and deleting the images which do not comprise the designated animal information in each group of image sequences.

5. The method of claim 1, wherein prior to assembling a plurality of video segments containing a specified action from the plurality of videos according to a pre-trained action detection model, the method further comprises:

collecting a plurality of training video segments comprising the specified actions, and labeling a real starting point and a real ending point of the specified actions in each training video segment;

and training the action detection model according to the plurality of training video segments and the real starting point and the real ending point of the designated action marked in each training video segment.

6. The method of claim 5, wherein training the motion detection model based on the plurality of training video segments, the true start point and the true end point of the designated motion marked in each of the training video segments comprises:

predicting a prediction starting point and a prediction ending point of the specified action in the corresponding training video segment according to each group of the training image sequence and the action detection model;

calculating a mean square error loss function according to the real starting point and the predicted starting point, the real ending point and the predicted ending point of the specified action in each training video segment;

if the mean square error loss function is not converged, updating the parameters of the motion detection model by using a gradient descent method;

and repeatedly training the motion detection model by adopting each training video segment according to the mode until the mean square error loss function tends to be convergent, and determining the parameters of the motion detection model so as to determine the motion detection model.

7. The method of claim 1, wherein obtaining a plurality of videos related to a given entity based on a pre-established knowledge-graph comprises:

and acquiring the video corresponding to each piece of video information from a video library to obtain the plurality of videos.

8. An apparatus for generating a video set, the apparatus comprising:

the acquisition module is used for acquiring a plurality of videos related to a specified entity based on a pre-established knowledge graph under the condition of receiving a generation request of a video set carrying the specified entity and a specified action;

a clipping module, configured to clip a plurality of video segments containing a specified action from the plurality of videos according to a pre-trained action detection model, wherein the plurality of videos are input into the action detection model, and the video segments containing the specified action are output by the action detection model;

a splicing module, configured to splice together the plurality of video segments including the specified action to obtain the video set;

wherein the apparatus further comprises a detection module configured to: if the designated entity is designated figure information, carrying out face detection on the plurality of video clips based on the designated figure information, and deleting the video clips which do not include the designated figure information in the plurality of video clips; and if the designated entity is designated animal information, performing feature detection on the plurality of video clips based on the designated animal information, and deleting the video clips which do not include the designated animal information in the plurality of video clips.

9. The apparatus of claim 8, wherein the clipping module comprises:

the extraction unit is used for extracting the images of each frame according to the time sequence of each video to obtain a group of image sequences;

a prediction unit, configured to predict a start point and an end point of the specified motion in the corresponding video according to each group of the image sequence and the motion detection model;

and the clipping unit is used for clipping the corresponding video segments containing the specified actions from the corresponding video according to the starting point and the ending point of each specified action to obtain the plurality of video segments.

10. The apparatus of claim 9, wherein the prediction unit is configured to:

11. The apparatus of claim 9, wherein the clipping module further comprises a detection unit configured to:

and if the designated entity is designated animal information, performing feature detection on the images in each group of image sequences based on the designated animal information, and deleting the images which do not include the designated animal information in each group of image sequences.

12. The apparatus of claim 8, further comprising:

the acquisition module is used for acquiring a plurality of training video segments comprising the specified actions and marking real starting points and real ending points of the specified actions in all the training video segments;

and the training module is used for training the action detection model according to the plurality of training video segments and the real starting point and the real ending point of the designated action marked in each training video segment.

13. The apparatus of claim 12, wherein the training module is configured to perform the training in a manner selected from the group consisting of:

calculating a mean square error loss function according to the real starting point and the prediction starting point, the real ending point and the prediction ending point of the specified action in each training video segment;

14. The apparatus of claim 8, wherein the obtaining module is configured to:

15. A computer device, the device comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.