CN113079420A

CN113079420A - Video generation method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN113079420A
Application number: CN202010006953.4A
Authority: CN
Inventors: 贾汉超; 马路漫; 王晓冰; 姜映映; 朱博
Original assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecom R&D Center; Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2021-07-06
Also published as: KR20210087861A

Abstract

The embodiment of the application provides a video generation method, a video generation device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: extracting the intention characteristic of the video generation request; and generating a target video based on the intention characteristics and the candidate video. Based on the scheme provided by the embodiment of the application, the target video capable of better reflecting the real intention of the user can be obtained, the actual requirements of the user can be better met, and the perception of the user is improved.

Description

Video generation method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video generation method, an apparatus, an electronic device, and a computer-readable storage medium.

Background

With the rapid development of science and technology and the improvement of living standard of people, cameras using films have gradually quit the stage, and users are more and more accustomed to taking photos and videos by using terminal devices such as mobile phones. However, the convenient shooting mode also brings some problems, because the shooting is relatively random, a large amount of video contents are repeated, a specific desired video is not easy to find, and a lot of useless clips needing to be cut off exist in the video.

In order to meet the application requirements of users, some products with a video generation function have appeared at present, but the video generated by the current video generation mode cannot meet the user requirements yet, and needs to be further optimized.

Disclosure of Invention

An object of the embodiments of the present application is to solve at least one of the above technical drawbacks, and in particular, to solve the problem of further optimization of the video generation method in the prior art. The scheme provided by the embodiment of the application is as follows:

in a first aspect, an embodiment of the present application provides a video generation method, where the method includes:

extracting the intention characteristic of the video generation request;

and generating a target video based on the intention characteristics and the candidate video.

In a second aspect, an embodiment of the present application provides a video generating apparatus, including:

the intention extraction module is used for extracting the intention characteristics of the video generation request;

and the video generation module is used for generating a target video based on the intention characteristics and the candidate video.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor; wherein the memory has stored therein a computer program; the processor is used for executing the method provided by the embodiment of the application when the computer program runs.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program performs the method provided by the present application.

The beneficial effect that technical scheme that this application provided brought is: according to the scheme provided by the embodiment of the application, when the target video is generated based on the video generation request of the user, the intention characteristic of the video generation request is extracted, the target video is generated based on the intention characteristic and the candidate video, and the intention of the user is considered when the target video is generated, so that the generated target video can better reflect the actual intention of the user, the actual requirement of the user is better met, and the perception of the user is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart illustrating a video generation method according to an embodiment of the present application;

fig. 2 shows a schematic flow chart of a video generation method provided in an example of the present application;

FIG. 3 illustrates a flow diagram of a screening of video segments provided in an example of the present application;

FIG. 4 is a schematic diagram illustrating a process for obtaining a video clip of a person according to an example of the present application;

fig. 5 is a schematic diagram illustrating a flow of acquiring an action video frequency band according to an example of the present application;

FIG. 6 is a diagram illustrating a Gaussian distribution of durations for two levels provided in an example of the present application;

FIG. 7 illustrates a flow chart for screening results of an action recommendation provided in an example of the present application;

FIG. 8 shows a schematic diagram provided in an example of the present application to illustrate the principle of action suggestion result screening;

FIG. 9 is a flow chart illustrating a screening process for action suggestion results provided in an example of the present application;

FIG. 10 is a diagram illustrating information related to an action video clip provided in an example of the present application;

FIG. 11 illustrates a flow chart for video segment screening based on intent characteristics provided in an example of the present application;

FIG. 12a shows a flow diagram of a calculation of relevance provided in an example of the present application;

FIG. 12b shows a schematic diagram of the correlation between feature vectors of a type provided in an example of the present application;

FIG. 12c shows a schematic diagram of the correlation between feature vectors provided in another example of the present application;

FIG. 13 is a schematic diagram illustrating a screening process for video clips provided in an example of the present application;

FIG. 14 is a schematic diagram illustrating a screening process of a video clip according to an example of the present application;

FIG. 15 illustrates a schematic diagram of a video segment fusion provided in an example of the present application;

FIG. 16 is a flow chart illustrating video clip generation for an application scenario in an example of the present application;

FIG. 17 shows a flow diagram of video clip generation for another application scenario in an example of the present application;

FIG. 18 illustrates a flow diagram of a video generation method based on an attention mechanism provided in an example of the present application;

FIG. 19 is a flow chart illustrating a method for determining attention weights blended into a user's intent provided in an example of the present application;

fig. 20 is a schematic structural diagram illustrating a video generating apparatus according to an embodiment of the present application;

fig. 21 shows a schematic structural diagram of an electronic device suitable for use in embodiments of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

With the rapid development of scientific technologies (especially, artificial intelligence technologies and technologies in the field of computer vision), and the improvement of living standard of people, more and more intelligent data processing modes begin to appear in daily life of people, and the automatic generation of videos is one of the modes, such as generating a new video based on a plurality of videos selected by a user or generating new videos based on some segments of one or more videos selected by the user. Through research, the inventor of the present application finds that, although some products on the market currently provide simple video retrieval and video generation functions, at least the following points to be improved exist in the existing video generation scheme:

most video generation methods simply select some pictures or videos at random, and then connect and synthesize the selected pictures or videos according to some simple rules (such as according to a time sequence), without considering content information of the videos, so that the generated videos have poor continuity and are unnatural, and the requirements of users are difficult to meet. On the other hand, in the existing video generation method, the user intention is not considered in the process of generating the video, the user cannot generate the video which the user wants, namely, the user cannot specify the theme of the video which the user wants to generate and the like, so that the video generated by the existing method is single in content and poor in richness

In order to solve at least one problem existing in the prior art, the application provides a new video generation method, and based on various optional implementation modes of the method, videos which are more interesting, various, flexible and better meet the requirements of users can be generated.

In order to make the objects, technical solutions and advantages of the present application clearer, various alternative embodiments of the present application and how the technical solutions of the embodiments of the present application solve the above technical problems will be described in detail below with reference to specific embodiments and drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 shows a schematic flowchart of a video generation method provided in an embodiment of the present application, and as shown in the diagram, the method mainly includes the following steps:

step S110: acquiring a video generation request;

the video generation request may include a request for what video the user wants to generate, that is, the request carries the user's intention. Specifically, for example, the video generation request may include the type of the video that the user wants to generate, the subject of the video that wants to generate, or the object to which the video that wants to generate is directed, and the request may also include the duration of the video that the user wants to generate.

As an example, the video generation request is "i want to generate a 5-minute basketball-playing highlight video" or "please help i generate a 5-minute basketball-playing highlight video", for example, the subject of the video that the user wants to generate is "basketball", and the video duration is 5 minutes. For another example, a video generation request is "i want to generate a highlight video for a child riding a bike", the request includes that the object for which the video is to be generated is "child", and the theme of the video is "riding a bike".

Step S120: extracting the intention characteristic of the video generation request;

specifically, the intention feature may be extracted by a Neural Network, such as an RNN (Recurrent Neural Network).

Step S130: and generating a target video based on the intention characteristics and the candidate video.

In this embodiment of the application, the candidate videos may be some videos specified by the user, and an execution subject of the method may be authorized to acquire all videos, that is, when the video generation request is acquired, an operation of acquiring the candidate videos is triggered, and at this time, all videos that can be acquired may be used as the candidate videos.

The specific way in which the user specifies the candidate video is not limited in the embodiment of the present application. For example, the user may specify which time periods to capture (may be captured by the user or captured from another device) as the candidate video, or may directly select the candidate video. As an alternative, for example, the video generation request of the user may include indication information of a candidate video (such as an acquisition time period of the video), and after the video generation request of the user is acquired, the video in the corresponding time period may be used as the candidate video based on the indication information; after the acquired video generation request of the user is received, all videos which can be authorized to be acquired are displayed to the user, and the user can specify candidate videos from the displayed videos.

In practical application, when a user does not specify which videos are candidate videos, the number of videos that can be acquired is large, for example, the videos that can be acquired may be all videos stored in a user terminal device or a cloud storage space (such as a cloud disk), but a considerable part of the videos may be unrelated to videos that the user wants to generate, so in order to reduce the number of candidate videos and reduce subsequent data processing amount, the acquired videos may be first screened, and the screened videos are used as the candidate videos.

In an optional embodiment of the present application, the candidate video may specifically be a candidate video obtained based on an intention feature of the video generation request; and/or acquiring the candidate videos based on keywords and/or keywords in the video generation request.

Specifically, as an alternative, when video screening is performed, videos may be filtered based on the intention features of the user video request, for example, video features of each obtained video may be extracted respectively, and a part of irrelevant videos is filtered based on the relevance between the intention features and the video features, for example, videos corresponding to video features whose relevance to the intention features is smaller than a set value are filtered. The set value can be configured according to actual requirements, the larger the set value is, the more videos are filtered, the smaller the number of acquired candidate videos is, the smaller the subsequent data processing amount is, and correspondingly, the smaller the set value is, the fewer videos are filtered. However, the setting value is too large, which may cause some or some videos or video segments in the filtered videos to be wanted by the user, and thus may cause some useful videos or video segments to be filtered, so that the setting value may be relatively small as an alternative when the videos are filtered.

As another alternative, all videos that can be acquired may be filtered based on certain keyword or keywords, etc. in the video generation request. Since the user generally initiates a request for generating a video in a targeted manner for some content (which may be an object, an event, time, etc.), the candidate video may be obtained by extracting keywords and/or keywords, etc. in the request and filtering the video based on these keywords.

Of course, besides the two optional modes, some other video screening modes may be configured according to actual application scenarios or requirements, and different screening modes may be used alone or in combination with each other.

According to the video generation method, the intention of the user is considered when the video is generated, so that the generated target video can better reflect the actual intention of the user, the actual requirements of the user are better met, and the use perception of the user is improved.

It should be noted that the video generation method provided in the embodiment of the present application is executed by an electronic device, specifically, may be executed by a user terminal device, and of course, may also be executed by a server, for example, in an application having a video generation function, the method may be executed by the server of the application, specifically, for example, a user may initiate a video generation request through a user interface of the application installed on the user terminal device of the user, the user terminal device may send the request to the server after receiving the request, the server may then, based on the video generation method provided in the embodiment of the present application, based on each video stored by the server or acquired from other storage spaces (such as an external storage device, a cloud storage space, a user terminal device, and the like), based on an intention characteristic of the video generation request of the user, and generating a target video, and after the generation of the target video is finished, transmitting the target video which can correspond to the generation intention of the user video to the user terminal equipment so that the user terminal equipment can play and/or save the video and the like. When the method is executed by the user terminal device, the user terminal device may extract the corresponding intention feature based on the acquired video generation request of the user, and generate a target video corresponding to the actual intention of the user based on the extracted intention feature and a video stored in the user terminal device and/or acquired from another storage space.

In the following description, the user terminal device is taken as an execution subject, and each video stored in the user terminal device is taken as an example to describe the video generation method, but as can be seen from the foregoing description, each video based on which the target video is generated may be stored in the terminal device, may be in an external storage device, may also be in a cloud or other storage space, and may be any video that can be acquired by the terminal device or the server under the authorized condition.

In alternative embodiments of the present application, the intent feature may specifically include an action intent feature.

In order to generate a target video capable of better meeting the user, when the intention features of the user are extracted, deep-layer features can be extracted from the video generation request of the user to extract and obtain action behavior intention features contained in the video generation request, that is, what action behaviors the user wants to generate or videos related to what action behaviors, so that the target video more meeting the requirements is generated based on the action behavior intention features and the candidate videos.

In an alternative embodiment of the present application, generating the target video based on the intention features and the candidate videos includes:

respectively extracting the video characteristics of each candidate video;

determining candidate video clips in each candidate video based on the video characteristics of each candidate video;

screening each candidate video clip based on the intention characteristics to obtain each target video clip;

target videos are generated based on the target video segments.

Specifically, when the target video is generated based on the intention features and the candidate videos, some candidate video segments in the videos can be screened out based on the video features of the candidate videos, and then the target video segments can be screened out from the candidate video segments based on the intention features. Based on the mode, the segmentation of each candidate video is realized, compared with a mode that one or some videos are directly used as target videos, the scheme can use the video clips screened out based on the video characteristics as the candidate video clips for generating the target videos, and by adopting the mode, the video clips with more various clip forms and contents can be segmented, so that the target videos generated subsequently can be more interesting and attractive.

The method for determining the candidate video segments from the candidate video based on the video features of the candidate video can be configured as required, for example, some segments with very similar contents in the candidate video can be filtered according to the similarity between the video segments in the candidate video, that is, the duplication removal of the similar segments is performed, and the object recognition can be performed on the video according to the object (such as a person) contained in the candidate video, and the segments containing the object are screened out.

In the embodiment of the present application, the video feature may specifically include a visual feature and/or an optical flow feature of the video.

The visual features of the video generally refer to features obtained by feature extraction of visual content in an image in the video, that is, features capable of characterizing the visual content in the image, where the visual content may include, but is not limited to, pixel values, color information, objects, and the like in the image. While the optical flow is due to the movement of the foreground objects themselves in the scene and/or the movement of the camera, when the eyes of a person observe a moving object, the scene of the object forms a series of continuously changing images on the retina of the person, and this series of transformed information constantly "flows" through the retina, i.e. the image plane, like a "flow" of light, and can therefore be referred to as optical flow. The optical flow feature is a feature that can reflect the change information of the image object.

For the extraction of the visual features and the optical flow features, the extraction can be specifically realized by a neural network, for example, the visual features can be extracted by a convolutional neural network, and the optical flow features can be extracted by a flontet (flow). Optionally, when the visual features are extracted through the Neural network, 3D-CNN (3Dimensions relational Neural Networks) may be adopted for extraction, and the visual feature extracted through the 3D-CNN may be processed to characterize the time sequence relationship between images of different frames of the video in addition to the video content.

As an optional mode, the visual feature of the embodiment of the application may simultaneously adopt the visual feature and the optical flow feature, so that the content and the change information of the content in the video can be more comprehensively reflected through different features, and then a video segment which is more in line with the user intention and rich in content can be screened out based on the visual feature, and a better target video is generated.

In an optional embodiment of the present application, for a candidate video, determining a candidate video segment in the candidate video based on a video feature of the candidate video may specifically include at least one of the following:

performing object recognition on the candidate video based on the video characteristics of the candidate video to obtain each first candidate video segment containing the target object;

and based on the video characteristics of the candidate video, performing action behavior identification on the candidate video to obtain corresponding time interval information of each video segment containing the action in the candidate video, and based on each time interval information, obtaining each second candidate video segment.

When performing object recognition, the recognized target object may be a designated object or may be a generic object, for example, the object may be a generic person or a certain class or a certain person, such as a man, a woman, or a child. In addition, in practical applications, if the video generation request includes the related information of the specified object, when the first candidate video segment including the target object is obtained based on the video features of the candidate video, the candidate video segment may be screened based on the video features and the related information of the object included in the video generation request, so as to screen out the subsequent video segment including the target object related to the user requirement. If the video generation request does not include the related information of the specified object, for example, if the video generation request is "please help me to generate a basketball highlight video", and the request does not include the related information of the specified object, object recognition can be performed on the candidate videos based on the video characteristics to obtain candidate video segments each including a target object (a widely pointed object, such as a person).

When the motion behavior of the video is identified based on the video features, the method may be specifically understood as analyzing the motion behavior that may exist in the video, and based on the difference of the motion behavior that may exist, the candidate video is divided into one or more (including two) video segments, and the period information of each segment, that is, the duration of the motion behavior that may exist in one video segment, is obtained, where the period information may specifically be a time period of one video segment in the candidate video, a start time and an end time, a start time and a segment duration, or an end time and a segment duration. For example, a video segment is a segment of 10 seconds to 15 seconds in the candidate video, and the period information may be a period range of 10 seconds to 15 seconds, may also include a start time 10 th second and an end time 15 th second, may also be a start time 10 th second and a duration 5 second, may also be a duration 5 second and an end time 15 th second, and may also be other configured period information representation manners.

For a period of time information, since the period of time information reflects the time information of a video segment in which there is a possibility of action behavior in the candidate video, the period of time information may also be referred to as an action suggestion result or action suggestion of a video segment, i.e., a period of time suggestion for a video segment in which there is a possibility of action behavior.

For a candidate video, after obtaining the action suggestion result of each video segment in the video, a second candidate video segment can be selected based on some rules, for example, a candidate video segment is selected based on the duration of each action suggestion result (which can be determined based on time period information); in addition, the selection may also be performed based on the intention feature of the video generation request and the segment feature of the video segment corresponding to each action suggestion result, for example, based on the correlation between the intention feature and the segment feature of the video segment corresponding to each action suggestion result, a video segment with higher correlation (for example, the correlation is greater than a set value) is selected as the second candidate video segment.

Specifically, when motion behavior recognition is performed on the candidate video based on the video characteristics of the candidate video to obtain information of each time period, the motion behavior recognition can be realized through a neural network.

In an optional embodiment of the present application, if a target video segment includes a first candidate video segment and a second candidate video segment that belong to the same candidate video, generating a target video based on each target video segment, and generating the target video includes:

fusing a first candidate video clip and a second candidate video clip which belong to the same candidate video in each target video clip;

and generating a target video based on the fused video clips.

For the same candidate video, since the first candidate video segment and the second candidate video segment are the segments of the candidate video but corresponding to two different types, one is a video segment containing an object (which may be simply referred to as an object segment), and the other is a video segment corresponding to a possible action (which may be simply referred to as an action segment), some, or all of the segments of the two types of video segments may be fused to obtain a part or all of the video segment corresponding to the action of the object. Of course, the candidate video segments that are not fused are also used in the process of generating the candidate target video, that is, if the fusion operation of the video segments is performed, the video segments may be used as candidate video segments for generating the target video based on the fused video segments and the candidate video segments that are not fused.

In an optional embodiment of the present application, obtaining each second candidate video segment based on each time period information may specifically include:

determining the segment duration of the video segment corresponding to each time interval information;

determining the grade of each time period information based on the segment duration corresponding to each time period information;

and determining a second candidate video clip from the video clips corresponding to the time interval information based on the grade of the time interval information.

That is, after determining each action suggestion result in the candidate video, the ranking of the suggestion results may be performed according to the segment duration corresponding to each suggestion result, and the second candidate video segment is screened from each video segment based on each ranking corresponding to all action suggestion results. The classification of the levels is performed based on the duration, the classification modes of the levels can be configured according to requirements, generally, the duration of the segment of the action recommendation result with the higher level is not less than the duration of the segment of the action recommendation result with the lower level, for example, if the level 1 is the highest level, that is, the greater the number corresponding to the level is, the lower the level is, the duration of the segment of the action recommendation result belonging to the level 1 is not less than the duration of the segment of the action recommendation result belonging to the level 2.

In practical applications, all video clips corresponding to the action suggestion results may be used as candidate video clips without performing ranking on the action suggestion results. And the mode of determining the candidate video clips based on the grade of the action suggestion result can filter out a part of video clips, thereby reducing the subsequent data processing amount and improving the generation efficiency of the target video.

Wherein, based on each grade corresponding to each action suggestion result, a specific manner of determining candidate video segments may be configured according to actual application requirements, for example, each video segment corresponding to some or some higher-grade action suggestion results may be determined as a candidate video segment, or a certain number of each video segment corresponding to all grades may be determined as a candidate video segment, of course, the number of the video segments corresponding to different grades may be the same or different, for example, 3 video segments in each grade may be used as candidate video segments, then 3 video segments may be selected from each video segment corresponding to each grade randomly or according to a preconfigured manner (for example, the screening duration is longer) as candidate video segments, of course, if the number of the video segments corresponding to a certain grade is less than 3, if 2 video segments belonging to the grade exist, then these 2 may be considered as candidate video segments.

In an optional embodiment of the application, the determining, based on the level of each time period information, a second candidate video segment from each video segment corresponding to each time period information includes:

determining a target grade in each grade;

and determining each video clip corresponding to each time interval information belonging to the target level as a second candidate video clip.

That is, when determining candidate videos based on the ranks, one or more target ranks may be determined first, and each video segment whose action suggestion result is the target rank may be taken as a candidate video segment. The manner of determining the target level may be configured according to actual application requirements, for example, one or some higher levels may be determined as the target level, and a level to which an action suggestion result with a time length of each action suggestion result within a certain time length range belongs may also be determined as the target level.

In an optional embodiment of the present application, if the video generation request includes a video duration of the target video, determining a target level in each level based on a level of each action suggestion result, including:

and determining a target grade in each grade based on the video time length and the time length threshold value corresponding to each grade.

That is to say, each level may correspond to one duration threshold, and the target level may be determined according to the video duration of the target video to be generated and the duration threshold corresponding to each level, so that the generated target video can better meet the actual requirement. Similarly, the specific manner of determining the target level may also be configured according to requirements, for example, one or more levels closest to the video duration to the duration threshold may be determined as the target level.

The duration threshold corresponding to each level may be determined according to an empirical value and/or an experimental value (e.g., a value obtained based on a large amount of sample data), for example, based on a large amount of samples, an average of the durations of the action recommendation results of the same level in the samples may be used as the duration threshold corresponding to the level (in the following description, the duration threshold corresponding to one level may be referred to as an average duration of the level). Generally, the duration threshold corresponding to the higher level is greater than the duration threshold corresponding to the lower level, for example, the duration threshold corresponding to the level 1 is 10s, and the duration threshold corresponding to the level 2 is 5 s.

In an optional embodiment of the present application, for a period of time information, determining a level of the period of time information based on a segment duration corresponding to the period of time information may include at least one of the following:

determining a time interval to which the segment time length corresponding to the time interval information belongs, and determining a grade corresponding to the time interval to be the grade of the time interval information, wherein each grade corresponds to a respective time interval;

and determining the grade corresponding to the time length threshold value closest to the segment time length corresponding to the time interval information as the grade of the time interval information.

In this alternative embodiment of the present application, two alternative schemes of determining action proposal results are provided. Specifically, one may be that a duration interval is configured for each level, and for each action suggestion result, the duration interval to which the segment duration of the action suggestion result belongs may be determined as the level to which the segment duration belongs; the other is that the grade corresponding to the time length threshold value closest to the segment time length of the action suggestion result can be directly determined as the grade of the action suggestion result.

Likewise, the duration interval and the duration threshold corresponding to each level may be determined based on experimental and/or empirical values. As an alternative, the duration interval of the level may be determined based on a large number of samples, and according to experience and statistical results, the average value of the segment durations of the action recommendation results of the same level in the large number of samples, that is, the average duration and the standard deviation of the duration.

In an optional embodiment of the present application, two adjacent levels correspond to a common transition duration interval, for a period of time information, a duration interval to which a segment duration corresponding to the period of time information belongs is determined, and a level corresponding to the duration interval to which the segment duration belongs is determined as a level of the period of time information, which may specifically include:

and if the segment time length corresponding to the time interval information belongs to the transition time length intervals of two adjacent levels, determining the two adjacent levels corresponding to the transition time length intervals as the levels of the time interval information.

That is, a common duration region, i.e., the transition duration interval, may be configured for two adjacent levels, and if a segment duration of a certain action suggestion result, i.e., the segment information, just falls within the common duration interval, the action suggestion result may belong to two levels at the same time.

The setting of the transition duration region can enable one action suggestion result to belong to two levels at the same time, and the setting of the transition duration region can have at least two specific advantages:

1) the distinction between the levels can be enabled not to be too abrupt, the transition can be made naturally, and the distinction is more suitable for human cognition, because, for each video clip in a certain video, the first frames of images of a certain video clip may have relevance with the last frames of images of the last video clip of the clip, and the last frames of images of the clip may also have relevance with the first frames of images of the next video clip of the clip, that is, one video clip may contain more than one possible potential action, and there may be some transition video frames containing different actions, so that the division of the level of the action suggestion result corresponding to each video clip can be more suitable for the actual situation by setting the transition duration interval.

2) The content diversity of the video clip can be enhanced: as can be seen from the foregoing description, the duration of a higher-level action suggestion result is generally longer than the duration of a lower-level action suggestion result, and by setting the transition duration interval, a video segment corresponding to a higher-level action suggestion result may include a relatively longer video segment or may cover a small number of video segments with shorter durations. Similarly, as can be seen from the foregoing description, in practical applications, transition regions are likely to exist between different video segments of the same video, and it is also possible that a certain action is still switched to another scene due to a shot transition when a video is recorded, so that no action exists in subsequent images, but actually the certain action is still continued. When the candidate video clips are determined based on the grades, since one grade can contain both relatively longer clips and relatively shorter clips, more forms and richer contents of the video clips can be covered, so that the types and the contents of the candidate video clips can be richer, and the clip forms and the contents are more diverse, thereby enabling the subsequently generated target video to be more interesting and attractive.

In an optional embodiment of the present application, determining a target level in each level based on the video duration and the duration threshold corresponding to each level includes:

determining a video limit duration based on the video duration;

and determining the target grade in each grade according to the video limiting time length and the time length threshold value corresponding to each grade.

In order to satisfy the principle that the video time length of the target video segment to be generated is included in the video generation request, since the finally generated target video may include the one or more candidate video segments (the first candidate video segment or the second candidate video segment), the time length of the candidate video segment finally used for generating the target video should not be greater than the video time length, a video limiting time length may be determined based on the video time length, and the target grade may be determined based on the limiting time length.

It is to be understood that, since the duration of the candidate video segment should not be greater than the video duration of the target video, the video limitation duration should not be greater than the video duration.

As an alternative, a time length adjustment factor n may be configured, and assuming that the video time length of the target video is T', the video limitation time length T may be expressed as: t ═ T'/n, where n is a positive number not less than 1, and n may be a positive integer not less than 1, or may not be an integer.

In an optional embodiment of the application, the levels are in a sequence from high to low, and a time length threshold corresponding to the current level is not less than a time length threshold corresponding to the next level; determining a target grade in each grade according to the video limiting time length and a time length threshold value corresponding to each grade, wherein the target grade comprises at least one of the following first mode and the following second mode:

the first method is as follows:

and sequentially comparing the video limiting time length with the time length threshold corresponding to the current grade according to the sequence of the grades in each grade from high to low until the video limiting time length is not less than the time length threshold corresponding to the current grade, and determining the current grade as the target grade.

The second method is as follows:

sequentially performing the following processing according to the sequence of the levels from high to low in each level until a target level is determined:

if the video limiting time length is not less than the time length threshold value corresponding to the current level, determining the current level as a target level;

and if the video limitation duration is less than the duration threshold corresponding to the current level, determining the target level or entering the next level for processing according to the first number of the period information of which the corresponding segment duration in each period information belonging to the current level is not greater than the video limitation duration and the second number of the suggested results of which the corresponding segment duration in each period information belonging to the next level is not greater than the video limitation duration.

Correspondingly, the determining, as the second candidate video segment, each video segment corresponding to each time period information belonging to the target level includes:

and determining each video clip corresponding to each time interval information with the duration less than the video limit duration in each time interval information belonging to the target level as a second candidate video clip.

For the first mode, the level with the first duration threshold smaller than the video limitation duration is determined as the target level in the order of the levels from high to low. For the second mode, when the time length threshold corresponding to the current level is smaller than the video limit time length, the current level is determined as the target level, and when the current level is larger than the video limit time length, according to the number of the action suggestion results of which the time length in the level is not larger than the video limit time length and the number of the action suggestion results of which the time length in the next level is not larger than the video limit time length, whether the current level is determined as the target level or the next level is entered to perform the determining process again.

Optionally, for the scheme in the second mode, determining the target level or entering the next level of processing according to the first number and the second number may include:

if the first quantity is not less than the second quantity, determining the current level as a target level;

if the first number is smaller than the second number and the next level is the last level, determining the next level as the target level;

if the first number is less than the second number and the next level is not the last level, processing of the next level is entered.

Specifically, when the duration threshold corresponding to the current level is greater than the video limit duration, if the number of action suggestion results with segment durations less than the video limit duration among all action suggestion results belonging to the current level is greater than the number of suggestion results with segment durations less than the video limit duration in the next level, it indicates that even a part of all video segments corresponding to the current level can be guaranteed to provide a sufficient number of action suggestion results meeting the duration requirement, that is, a sufficient number of video segments meeting the duration requirement can be provided, and therefore, the current level can be determined as the target level at this time.

In addition, as can be seen from the foregoing description, there may be a transition duration interval between adjacent levels, and one motion suggestion result may also belong to two levels at the same time, in practical applications, there may be more choices when generating a target video subsequently, and when one motion suggestion result belongs to two levels, if any one of the two levels is determined as a target level, the motion suggestion result may be used as a motion suggestion result belonging to the target level. Of course, when the target level is determined according to the time length threshold corresponding to each level, if the scheme of the second mode is adopted, the action suggestion results belonging to two levels at the same time can also be used as the action suggestion results in the two levels to which the action suggestion results belong at the same time. If an action suggestion result belongs to both level 1 and level 2, whether level 1 or level 2 is determined as the target level, the video segment corresponding to the action suggestion result can be used as a candidate video segment, and the action suggestion result can be calculated as one action suggestion result in level 1 or one action suggestion result in level 2.

In an optional embodiment of the present application, when screening each candidate video segment based on the intention features to obtain each target video segment, the method specifically includes:

acquiring video clip characteristics of each candidate video clip;

respectively determining the relevance of the intention characteristic and the video segment characteristic of each candidate video segment;

and screening the candidate video clips based on the correlation corresponding to the candidate video clips to obtain the target video clips.

In an optional embodiment of the present application, the screening of each candidate video segment based on the intention features to obtain each target video segment may specifically include:

acquiring video clip characteristics of each candidate video clip;

determining the weight of each video clip to be processed based on the intention characteristics and the video clip characteristics of each video clip to be processed; each video clip to be processed is each candidate video clip, or each video clip obtained by screening each candidate video clip based on the relevance of the intention characteristic and the video clip characteristic of each candidate video clip;

and screening out each target video clip based on the video clip characteristics and the weight of each video clip to be processed.

In order to determine the video segments finally used for generating the target video, after obtaining the candidate video segments, the candidate video segments may be further screened based on the user intention, so as to screen out the target video segments which are more consistent with the user intention, and the target video is generated based on the target video segments. In this way, an optional manner is to filter based on the correlation between the intention feature and the segment feature of each candidate video segment, and use the candidate video segment with a greater correlation as the target video segment, for example, use the candidate video segment with a correlation greater than a set threshold as the target video segment. Alternatively, the weight of each candidate video segment may be determined based on the intention feature and the video segment features of the candidate video segments, that is, the importance of each candidate video segment is analyzed, and then the target video segment is screened out based on the weight of each candidate video segment. In another alternative, the two manners may be combined, specifically, the candidate video segments with a larger correlation may be screened out based on the similarity between the segment feature and the intention feature of each candidate video segment, and then the importance of each candidate video segment with a larger correlation is further analyzed based on the intention feature, so as to screen out the target video segment from each video segment with a larger correlation.

In an optional embodiment of the present application, determining a weight of each to-be-processed video segment based on the intention feature, and screening out each target video segment based on the video segment feature of each to-be-processed video segment and the weight of each to-be-processed video segment may specifically include:

based on the intention characteristics and the video clip characteristics of each video clip to be processed, the following operations are sequentially executed to obtain a target video clip corresponding to each operation:

determining the weight of each video clip to be processed of the current operation based on the intention characteristics, the video clip characteristics of each video clip to be processed and the weight of each video clip to be processed determined by at least one operation before the current operation;

and screening out the target video clip corresponding to the current operation based on the video clip characteristics of each video clip to be processed, the weight of each video clip to be processed of the current operation and the target video clip determined by at least one operation before the current operation.

Alternatively, in order to generate a sorted video corresponding to a certain rule for a user, the above operations may be performed multiple times, each time one target video segment is screened out, and a target video with more consistent video content is generated based on multiple target video segments with sorting (the order in which the target video segments are screened out) screened out by the multiple operations.

When determining the weight of each to-be-processed video segment in the current operation, the weight of each to-be-processed video segment determined by at least one operation before the current operation may be merged, for example, the weight of each to-be-processed video segment corresponding to the last operation of the current operation may be merged, so that when determining each weight corresponding to the current operation, associated history information is also taken into consideration, and thus, the weights determined by each operation may have better intrinsic relevance, and when determining the target video segment corresponding to the current operation based on the determined weights, a target video segment having better intrinsic relevance with the previously screened target video segment and better potential time sequence among the segment contents of each screened target video segment may be obtained. Similarly, when the target video segment is screened based on the current weight, the target video segment determined by at least one operation before the current operation, that is, already screened, may be further considered, so as to further screen the current target video segment having a better timing relationship with the determined video segments.

It can be understood that, when the above operation is performed for the first time, since there is no operation before the current operation, specifically, the weight of each to-be-processed video segment of the current operation may be determined based on the intention feature and the video segment feature of each to-be-processed video segment, and similarly, when the target video segment of the current time is screened, the target video segment corresponding to the current operation may be screened based on the video segment feature of each to-be-processed video segment and the weight of each to-be-processed video segment of the current operation.

In practical applications, in order to avoid the excessive number of times of executing the operation or the excessive number of the screened target video segments, when the operation is continuously executed to screen the target video segments, corresponding operation execution stop conditions may be configured according to different application scenarios, for example, when a user requests a video duration of a target video, the corresponding operation execution stop conditions may be configured according to the video duration of the target video required by the user, when the total duration of the screened target video segments is not less than the video duration of the target video, the operation may be stopped, and when the user does not request the video duration of the target video, a default video duration or the number of video segments included in the target video may be configured as the operation stop conditions. For another example, if the user does not request the video duration of the target video, different processing manners may be adopted according to the type of the to-be-processed video segment, for example, when the to-be-processed video segment is the candidate video segment, the condition of stopping the operation may be configured according to the requirement, and when the to-be-processed video segment is each video segment obtained by screening each candidate video segment based on the correlation between the intention characteristic and the video segment characteristic of each candidate video segment, because each to-be-processed video segment is a video segment already screened based on the intention of the user, the condition of stopping the operation may not be configured at this time, that is, the finally determined number of each target video segment may be the number of each to-be-processed video segment, and each to-be-processed video segment having the time sequence relationship is obtained through the above operation. Of course, when each video to be processed is a video filtered based on the user's intention, the condition for stopping the operation may also be configured, which is just a few optional ways.

In addition, it should be noted that the above-mentioned segment contents have a time sequence, which is not a real chronological sequence, where the time sequence may be understood as a relative time sequence on the content of one video segment, for example, if one video segment is a video corresponding to a summer scene, and another video segment is a video corresponding to an autumn scene, when a target video segment corresponding to each of the above-mentioned operations is obtained, the video corresponding to the summer scene may be screened out by a first operation, and the video corresponding to the autumn scene may be screened out by a second operation. As an illustrative example, when the recurrent neural network is used to realize the filtering of the target video segment, the operations of the different operations may be different processing time instants or time steps in the recurrent neural network.

In an optional embodiment of the present application, determining a weight of each to-be-processed video segment based on the intention feature and the video segment feature of each to-be-processed video segment, and screening out each target video segment based on the video segment feature and the weight of each to-be-processed video segment may specifically include:

for the first operation, determining the weight of each video clip to be processed at the initial moment based on the intention characteristics and the clip characteristics of each video clip to be processed; obtaining the weighted video segment characteristics of the first operation based on the weight of each to-be-processed video segment of the first operation; obtaining a hidden state characteristic of the first operation based on the weighted video segment characteristic of the first operation; obtaining a target video clip of the first operation based on the hidden state feature and the weighted video clip feature of the first operation;

for other operations except the first operation, determining the weight of each to-be-processed video clip of the current operation based on the hidden state feature and the intention feature of the last operation and the video clip feature of each to-be-processed video clip; obtaining the weighted video segment characteristics at the current moment based on the weight of each to-be-processed video segment of the current operation; obtaining the hidden state feature of the current operation based on the hidden state feature of the previous time, the weighted video clip feature of the current operation and the target video clip of the previous operation; and obtaining a target video clip of the current operation based on the hidden state feature of the current operation and the weighted video clip feature of the current operation.

This alternative of the present application provides a way to determine a target video segment based on an attention mechanism and a time-cycled neural network. In most of the existing sequencing-based video generation methods, semantic information of videos, themes of video contents and the like are generally not considered, and sequencing is performed only according to simple rules, so that the generated videos have poor connectivity and do not meet the requirements of users; meanwhile, when a complex scene is faced, it is difficult to extract and summarize the video generation rule from complicated data. In order to solve the problem, the application provides the video generation method based on the attention mechanism, which can automatically edit the video clip based on the video content and the user intention to generate the video wanted by the user. In the scheme of the application, the user intention, namely the intention characteristic reflecting the user intention is added into the calculation of the attention weight, so that the user intention is fused when the weight of each video clip is determined, in addition, the time cycle neural network based on the attention mechanism is used for obtaining the target video clip in each operation, the user intention and the video characteristic are simultaneously considered, and the intrinsic relation between each video clip to be processed (such as each candidate video clip) can be extracted through the neural network, so that the video generation effect based on the scheme is better than that based on the rule method.

When the hidden state feature of the current operation is obtained based on the hidden state feature of the previous operation, the weighted video segment feature of the current operation, and the target video segment of the previous operation, the hidden state feature of the current operation may be obtained based on the hidden state feature of the previous operation, the weighted video segment feature of the current operation, and the index of the target video segment of the previous operation. For a certain operation, since the hidden state feature of the last operation can reflect the related information of the initial operation, i.e. the first operation to the video clip recorded by the last operation network, or at least reflect the state of the last operation, i.e. the target video segment selected by the last operation, so that, when determining the hidden state feature of the current operation, if the current operation is not the initial operation, the hidden state characteristics of the last operation can be considered at the same time, so as to better obtain the hidden state characteristics of the relevant information of the current operation and the operations before the current operation, therefore, the video clip of the current operation with better internal connection with the target video clip of the previous operation can be determined subsequently based on the hidden state feature and the weighted video clip feature of the current operation.

In an optional embodiment of the present application, if the video generation request includes a video duration of the target video, after the target video segment at the current time is screened out, the method further includes:

determining the total time length of each screened target video clip;

if the total duration is less than the video duration, performing the screening operation of the target video clip of the next operation;

and if the total time length is not less than the video time length, finishing the screening operation of the target video clip.

In order to generate a video with a duration that meets the time length desired by the user, in the embodiment of the present application, after each determined target video segment of an operation, it may be determined whether the total duration of each video segment that has been currently determined has met a requirement, if the required video duration has been met, the target video may be generated based on the determined video segments, and if the determined total duration of the video segments has not met the required video duration, the selection of the target candidate segment may be continued.

In an optional embodiment of the present application, if each target video segment is a target video segment screened by each operation, generating a target video based on each target video segment includes:

and fusing the target video clips based on the screening sequence of the target video clips to obtain the target video.

After the target video clips at all times are obtained, the video clips can be fused based on the sequence of the screened target video clips, and the target video with good sequence relation of the content of the video clips is obtained, so that the actual situation is better met, and the requirements of users are met.

In an optional embodiment of the present application, generating a target video based on each target video segment may specifically include:

screening each target video clip based on the correlation degree between the video clips in each target video clip;

and generating a target video based on the screened target video clips.

In practical application, after each target video segment is obtained by screening from the candidate video segments, each target video segment can be directly adopted to generate a target video, and each target video segment can also be screened again to better meet the requirements of users.

When the target video segments are screened based on the correlation between the video segments, different screening methods can be adopted for different application requirements, for example, if the correlation between the video segments included in the target video to be generated is as high as possible, some of the video segments with higher correlation can be screened based on the correlation between the video segments in the target video segments, and if the video content in the target video to be generated has diversity and the content is as rich as possible, some of the video segments with lower correlation can be screened based on the correlation between the video segments in the target video segments.

It should be noted that, in the foregoing, when each target video segment is screened based on the video segment characteristics and the weight of each to-be-processed video segment, if each to-be-processed video segment is selected from each video segment screened based on the correlation between the intention characteristic and each subsequent video segment, the processing of the scheme may be further performed on each to-be-processed video segment, that is, the screening may be performed again based on the correlation between each to-be-processed video segment, and the screening of each target video segment is performed based on each video segment after the screening is performed again. That is, in practical applications, the alternatives provided by the embodiments of the present application may be combined with each other.

In an optional embodiment of the present application, the screening of each target video segment based on the correlation between the video segments in each target video segment includes:

determining a reference video clip in each target video clip;

taking the reference video clip as an initial screened target video clip, and repeatedly executing the following operations on all video clips to be screened except the screened target video clip in all target video clips until all screened target video clips are determined:

and respectively determining the correlation degrees of the screened target video segments and the video segments to be screened, and determining the video segment to be screened corresponding to the minimum correlation degree smaller than a set value in the correlation degrees as a new screened video segment.

That is, in the video segment screening, one or more (including two) reference video segments may be selected first, for example, based on the correlation between the video segment characteristics and the intention characteristics of each target video segment, the target video segment with the highest correlation is used as the reference video segment, the selected reference video segment is used as the filtered segment for generating the target video, and then, then, based on the correlation degrees between the screened segments and other target video segments, the process of using the target video segment corresponding to the minimum correlation degree smaller than the set value in the calculated correlation degrees as the next screened video segment may be repeated until there is no segment satisfying the condition (i.e., the lowest correlation degree with the correlation degree smaller than the set value) in other target video segments except the screened segments.

Based on the scheme, some video clips with low relevance can be screened out from all target video clips, so that the target videos with rich and diversified contents can be generated based on the screened videos. Of course, as can be seen from the foregoing description, according to different application scenarios, some video segments with higher correlation may also be screened out, so that the contents of the video segments included in the target video are more similar, and the continuity of the video contents is likely to be relatively higher. Of course, the two target videos can be provided to the user at the same time, and the user can select one or both of the target videos according to the actual needs.

It should be noted that, in this scheme, if there are a plurality of screened target video segments, for one video segment to be screened, when determining the correlation between the screened target video segment and the video segment to be screened, different determination manners may be configured, for example, the correlation between the video segment to be screened and each screened target video segment may be respectively calculated, and the maximum correlation, the minimum correlation, or the average value of the correlations among the video segments to be screened is used as the correlation corresponding to the video segment to be screened, or one of the screened target video segments (for example, a latest screened segment or a reference video segment) may be selected, and the correlation between the selected segment and the video segment to be screened is used as the correlation corresponding to the video segment to be screened.

In an optional embodiment of the present application, generating a target video based on each of the target video segments may specifically include:

determining the action type of the action behaviors contained in each target video clip;

and generating a target video based on each target video clip and the action type corresponding to each target video clip so as to show the action type corresponding to each target video clip to the user.

That is to say, when the target video is generated, the action types corresponding to the video clips included in the video may be provided to the user, so that the user can know which action video clips are specifically included in the finally generated target video. For example, for each video clip included in the target video, the identifier of the action type of each video clip may be carried on each corresponding video frame, or carried on the first or several video frames of each clip, or carried on the last frame of the last clip, so that when the user watches the target video and plays each video clip, the action corresponding to the video to be played can be known from the video frames. For another example, the motion types may not be carried on the video frames of the generated target video, but the motion types corresponding to the target video and the segments included in the target video are separately displayed to the user, for example, the motion types corresponding to the video segments included in the target video are displayed to the user according to the playing sequence of the video segments in the target video, and for example, description information of the target video may be generated, where the description information may include the motion types corresponding to the target video.

In order to better explain the video generation scheme provided by the embodiment of the present application, some schemes involved in various optional embodiments of the present application are further described in detail below with reference to some specific examples, in which an object in a frame image of a video is described by taking a person as an example.

The scheme provided by the embodiment of the application can comprise the following two stages:

stage 1, aggregation of video clips: segmenting a plurality of segments of videos (corresponding to the candidate videos in the foregoing) according to different content semantics (characters and behavior actions), extracting character action semantic information from the segmented segments (corresponding to the candidate video segments in the foregoing), and filtering segments irrelevant to the user intention, namely filtering the candidate video segments based on intention characteristics to obtain each target video segment.

And stage 2, generation of a target video: and combining the segmentation segment obtained in the previous stage, namely the candidate video segment, with the user intention to generate the highlight video with given duration.

Embodiments of each stage will be described in detail below with reference to specific examples.

Fig. 2 is a schematic flow chart of a video generation method provided in an example of the present application, and as shown in fig. 2, the implementation of the method mainly includes the following aspects:

● inputs: a plurality of videos of the user and a video generation request of the user; and (3) outputting: finally, the generated video is the target video;

specifically, the video generation request may include a time length of a video that is desired to be generated (i.e., a time length of the video in the foregoing text, an input video time length shown in the figure, such as 15 seconds) and a user intention (e.g., "i want to play basketball" shown in the figure).

In practical application, various different manners in which a user can initiate a video generation request can be configured, for example, the user can initiate the request through a voice instruction, can also initiate the request through text input, and can also configure a more detailed function for the user, for example, after the user initiates a video generation request, the user can be presented with a video personalized information configuration interface through which the user can input relevant information of a video that the user wants to generate, such as one or more of information of the video duration, videos about what people are, videos about what scenes are and the like, and the user can make relevant settings or no settings (i.e., default settings) as required through the interface. Specifically, for example, if the user wants to generate a video about a person, the gender, age, or some other person attribute information of the person may be configured through the interface, and of course, the user may also directly provide at least one image of the person, when the target video is generated, the image provided by the user may be subjected to face recognition, a video only including the person may be used as the candidate video, and when the first candidate video segment including the object is determined, only video segments including the person in the candidate video may be used as the first candidate video segment.

● pretreatment: extracting video features of each candidate video and intention features of the video generation request, namely user intention;

in this example, the video features include visual and optical flow features. Alternatively, as shown in fig. 2, the 3D-CNN may be used to extract visual features (i.e., feature maps), and the output tensor of the 3D-CNN is denoted by L × H × W × C, where L denotes the frame number of the input video, H and W denote the height and width of the output feature map (which may be referred to as a visual feature map), respectively, and C denotes the number of output channels, i.e., the number of feature maps. Optical flow features can be extracted using Flownet, the output tensor of Flownet is denoted by L × H × W × 2, similarly, L, H and W represent the number of frames of the input video and the height and width of the output feature map (which may be referred to as an optical flow feature map), respectively, 2 represents the number of output channels, and since optical flow features are information for reflecting the change in the position of a pixel point in two adjacent frames of images, the number of output feature maps of optical flow features is 2, and the value of the same point in 2 output feature maps reflects the pixel displacement of the point.

The optical flow feature map and the visual feature map may have the same or different image sizes, and if the optical flow feature map and the visual feature map are different, in order to fuse the two types of visual feature maps (specifically, a concatenation (Concatenate) operation as shown in the figures), the fused feature (which may be referred to as a dual-stream feature) is used as a video feature in a subsequent processing process, and at least one of the optical flow feature map and the visual feature map may be processed (e.g., upsampling and/or downsampling processing), so as to obtain the two feature maps having the same size.

In practical application, as an optional implementation manner, the heights and widths of the visual feature map and the optical flow feature map may specifically be the height and width of a frame image of an input video, respectively, and a value of a same point in the two optical flow feature maps reflects a pixel displacement of the pixel point in the frame image of the video.

For the intention feature, optionally, as shown in the figure, the intention feature may be extracted through an RNN (Recurrent neural network), specifically, the video generation request may be input into the RNN, and a corresponding intention feature is obtained based on an output of the RNN, where the intention feature may specifically be a feature vector with a set dimension (which may be configured according to an application requirement), for example, may be a feature vector with 256 dimensions.

● video segment aggregation and video generation

In the examples of the present application, a module for implementing the video segment aggregation function may be referred to as a video segment aggregation module, a module for final target video generation may be referred to as a video generation module, as shown in fig. 2, the video segment aggregation module is specifically configured to generate candidate video segments (corresponding to the first candidate video segment and/or the second candidate video segment in the foregoing), and the video generation module generates the target video based on the candidate video segments, of course, when generating target videos based on the target video segments, video background music can also be added, the music may be user selected, pre-configured, or screened from candidate background music based on user intent or other relevance, such as matching background music, may be selected based on the content of each target video segment used to generate the target video.

The video segment aggregation section and the video generation section are described in detail below, respectively.

1. Video segment aggregation based on video understanding

In practical applications, users generally shoot videos around human activities, and it can be said that people and human activities are the most core semantic information in the videos. In the prior art, a video clip with a person facing a shot is usually segmented for one video, and multiple sections of videos cannot be analyzed, so that semantic information of the person is not extracted strictly. In the application, semantic information of characters and action behaviors of a plurality of videos is fully considered, a multi-stage action recognition scheme (described in detail later) and character understanding processing are adopted, detailed character and action semantic information of video clips can be obtained, for example, the semantic information indicates that a person does something, and then analysis is performed by combining with the intention of a user, so that the video clips desired by the user can be obtained.

The video clip aggregation scheme based on video understanding is provided by the application aiming at the problem that the generated video cannot better meet the user requirement due to the fact that user intention is not considered in the existing video generation scheme and video semantic information is not applied.

As shown in fig. 2, a module for implementing the video segment aggregation function may be referred to as a video segment aggregation module, and a flow of a scheme implemented by the module is shown in fig. 3, where an input of the module may include the dual stream feature, the intention feature, and the video length, that is, the video duration, obtained in the foregoing, and of course, the video duration may be an optional item, and if the video duration is not obtained, the input may not include the video duration, and of course, if the video duration input by the user is not obtained, a preconfigured duration may be adopted, or the duration limit is not considered, and the target video is generated based on the finally determined target video segment.

As shown in fig. 3, the scheme implemented by the video segment aggregation module may mainly include three parts, one part is used for person identification to obtain a first candidate video segment containing a person, and the other part is used for performing action behavior analysis (multi-stage action identification shown in the figure), and obtaining a second candidate video segment based on the analysis result (i.e. the period information in the foregoing text, i.e. the action suggestion result). If the candidate video segments obtained based on these two branches include both the first candidate video segment corresponding to the person and the second candidate video segment corresponding to the result of the motion analysis, a portion may be included in which the first candidate video segment and the second candidate video segment are fused (segment fusion shown in the figure), and a target video segment (best video segment shown in the figure) is determined based on each candidate video segment after the fusion process. Specifically, the fusion processing step may be that after the target video segment is determined based on each first candidate video segment and each second candidate video segment, if the determined target video segment includes the first candidate video segment and the second candidate video segment belonging to the same candidate video, the first candidate video segment and the second candidate video segment belonging to the same candidate video in each target video segment may be fused at this time, and the target video is generated based on each video segment after the fusion processing.

The details of the parts shown in fig. 3 will be described below. It should be noted that, the sequence numbers in the following description are not used to limit the implementation sequence of each portion, but are only used to facilitate describing an identifier added to the content of each portion, and in practical applications, the implementation sequence of each portion may be configured according to actual needs, or may be executed synchronously.

1.1. About person identification

As an alternative, fig. 4 shows a schematic flow chart of obtaining a video segment (i.e., a first candidate video segment) including a person based on person recognition, as shown in fig. 4, the alternative mainly includes the following 3 parts, as shown in the steps (i), (ii), and (iii) shown in the figure, which are described in detail as follows:

■ pedestrian detection and tracking: the method can be realized by a pedestrian detection and tracking method, the double-flow characteristic of the video is input, and the pedestrian track (such as the position information of a person in an image) is output;

■ face recognition: for each pedestrian track, face feature extraction and identity confirmation can be carried out by using a face recognition algorithm;

■ face clustering: the face clustering algorithm can be used for clustering each pedestrian track according to the face recognition result and giving unique identity numbers to all the appeared people, and the video slices of all people in the video, specifically the information of the people video slices shown in the figure 4, can be obtained through the processing.

In the example shown in FIG. 4, for a video (the video 001 shown in the figure, 001 is the identifier of the video), wherein 3 persons, namely person A, person B and person C, are identified to be included in the video, 3 person video clips are obtained, namely a video clip containing person A in the video 001 for a period of 0.5-5 seconds, a video clip containing person B in the video 001 for a period of 2-6 seconds, and a video clip containing person C in the video 001 for a period of 10-20 seconds.

1.2. Multi-level behavior recognition

The embodiment of the application provides a recognition method which can adaptively segment the action segments of an input video and provide action categories with different levels. When the time length of the required video input by the user, namely the time length of the video, can be obtained, the execution of the step can be integrated with the user requirement information of the video time length of the target video.

Fig. 5 shows a schematic flow diagram of the multi-stage behavior recognition scheme, where the input of the part shown in the diagram mainly includes a dual stream feature and an input video length (i.e., a video duration of a target video, and optional content), and the input is a video segment, i.e., a second candidate video segment, and the scheme mainly includes three steps, such as the steps shown in the diagram (i), ii, and iii), and the part is specifically described as follows:

step (i), step S1 shown in the figure: and generating a suggestion hypothesis, namely generating an action suggestion result, namely generating corresponding time interval information of each video clip possibly containing the action in the candidate video.

The purpose of this step is to generate a large amount of action suggestion hypotheses, namely instant information (also referred to as suggestion hypotheses, action suggestions or action suggestion results for short) based on the dual-stream feature, where each action suggestion hypothesis specifically may include a start time and an end time of an action (action that may exist in a video segment), and if suggestions P1(1 to 14 seconds) are provided, the start time of the action suggestion hypothesis is 1 st second and the end time of the action suggestion is 14 th second of the video, and the action suggestion result indicates that an action is likely to exist in the 1 st to 14 th video segments in the video, and the segment duration is 14 seconds. In the example shown in fig. 5, a plurality of action suggestion hypotheses (action suggestions shown in the figure) are shown, in this example, each action suggestion hypothesis is directly represented by a start time and an end time thereof, and one action suggestion hypothesis is represented as (T1, T14) as shown in the figure, where T can be understood as time, and the subscript takes values of the start time and the end time, that is, (T1, T14) the start time of this action suggestion hypothesis is the 1 st second and the end time is the 14 th second of the video.

The determination of the action assumption result may be specifically implemented by a neural network, and the neural network is trained by using a large number of video samples, so that the neural network can implement an action suggestion assumption corresponding to a video segment in a video based on features (such as dual-stream features) of the video. Specifically, the dual-stream features of the video obtained in the previous stage may be further extracted by a feature extraction network, and then a large number of action suggestion hypotheses may be generated according to the extracted action suggestion features, for example, the dual-stream features of the video may be input into a 3D convolution network including 2 layers of convolution kernels of 3 × 3, and further deeper-layer action suggestion features may be extracted by the network, and then a large number of action suggestion hypotheses may be generated according to an existing action suggestion generation manner, and then the action suggestions may be classified into different levels according to a level decision policy according to time lengths of the action suggestions. Generally, for recommendations of longer duration, the rating is higher, such as rating 1.

For the grade determination assumed by the action suggestion, the embodiment of the present application provides two alternative grade determination strategies, one with higher precision and better performance is referred to as a strategy J1 hereinafter, and the other with fast operation speed and less resource consumption is referred to as a strategy J2 hereinafter, and the grade determination schemes based on the two strategies will be described separately with reference to fig. 5 below.

a. Strategy J1

In this example, for a video, it is assumed that 5 motion suggestion hypotheses are obtained based on the dual-stream feature of the video, and are respectively denoted as P1(1 to 14s), P2(2 to 8s), P3(3 to 8s), P4(11 to 20s), and P5(21 to 25s), where s is a unit of time of seconds, and two times in parentheses are the start time and the end time of a video segment corresponding to each motion suggestion hypothesis in the video, and then the segment durations of the 5 motion suggestion hypotheses are respectively 14s, 6s, 5s, 9s, and 4s, and the 5 motion suggestion hypotheses are taken as an example to perform relevant description hereinafter.

The action suggestion hypothesis for each level may be obtained based on a large amount of training data (i.e., training samples, specifically, sample videos) according to experience and statistical analysis, and the duration threshold and the duration interval corresponding to the level. Specifically, according to experience and statistics, within a certain error range, motion segments (segments in which motion exists in a video) of different levels in training data can be considered to be subject to gaussian distribution of time duration, and a time duration interval [ mean-N-degree, mean + N-degree ] of different levels can be used to determine the level of each motion suggestion hypothesis, where mean is a mean value of gaussian distribution and degree is a standard deviation, for example, for level 1, a corresponding time duration interval is determined based on the mean value and the standard deviation of gaussian distribution corresponding to the motion segment of the level, where N is a value preset as needed, N is a positive number, a specific value of N can be configured as needed, an optional value of N can be 3, and when taking the value, a value of a time duration interval of each level can cover 99.7% of training samples, that is, the duration of the action recommendation results for each level in the 99.7% training sample can fall within the time interval of the corresponding level determined based on the value.

In the following description, taking only two levels of training data as an example, fig. 6 shows a schematic diagram of gaussian distributions over time durations corresponding to two levels, i.e., level 1 and level 2 (the level of level 1 is higher than the level 2), in which a time period [ t1, t2] is a transition time duration interval of level 1 and level 2. For each action suggestion hypothesis, a rank of the action suggestion hypothesis may be determined based on a duration corresponding to the action suggestion hypothesis and a duration interval corresponding to the ranks.

Specifically, assuming that the duration of an action suggestion hypothesis (i.e. the duration of the video segment corresponding to the action suggestion hypothesis) is t, the level of the action suggestion hypothesis can be determined by comparing the duration intervals [ mean-N × determination, mean + N × determination ] of the two levels with the duration t of the action suggestion hypothesis. As shown in fig. 6, t1 is a left section boundary of level 1, and t2 is a right section boundary of level 2. When t < t1, the recommendation is assumed to be level 2; when t > t2, the recommendation is level 1; when t falls within the [ t1, t2] interval, the recommendation belongs to both level 1 and level 2.

For the 5 action suggestions described above, namely action suggestions P1 (1-14 s), P2 (2-8 s), P3 (3-8 s), P4 (11-20 s), and P5 (21-25 s), the corresponding segment durations are 14 seconds, 6 seconds, 5 seconds, 9 seconds, and 4 seconds, respectively, applying the above strategy J1, action suggestions P1 and P4 can be labeled as level 1, P5 as level 2, and P2 and P3 belong to both level 1 and level 2. Further, as shown in fig. 5, with several action suggestion hypotheses, the strategy J1 may also be used to determine the level of each action suggestion hypothesis, for example, the duration corresponding to the action suggestion (T1, T14) is 14 seconds, the level is 1, the duration corresponding to the action suggestion (T55, T60) is 5 seconds, and the level is 2.

b. Strategy J2

Compared with the strategy J1, the strategy is simple in calculation, low in equipment resource consumption during calculation and suitable for scenes sensitive to operation time. For strategy J2, the average duration of segments of different levels in the training data may be first calculated (as the duration threshold corresponding to each level). In determining the level of each action suggestion hypothesis, the level with the average duration closest to the duration of the action suggestion hypothesis may be selected as the level of the action suggestion by comparing the duration of the action suggestion hypothesis with the average duration of the levels.

Still taking the two levels shown in fig. 6 as an example, for example, t1 and t2 are the durations of two different action suggestions respectively, since t1 is nearest to the average duration (mean) of level 2, the level of the action suggestion corresponding to t1 is level 2; similarly, t2 is closest to the average duration of level 1, and the action suggestion corresponding to t2 has a level of 1.

Applying policy J2, if the average duration of level 1 is 10 seconds, the average duration of level 2 is 5 seconds, and for the above action suggestions with durations of 14 seconds, 6 seconds, 5 seconds, 9 seconds, and 4 seconds, respectively, then the durations of action suggestions P1 and P4, both closest to 10 of 10 seconds and 5 seconds, may be labeled level 1, and the durations of action suggestions P2, P3, and P5, both closest to 5 of 10 seconds and 5 seconds, may be labeled level 2.

After the grade of each action suggestion hypothesis is determined, the following step (II) is carried out.

Step ii, corresponding to step S2 in the figure: it is proposed to select, i.e. determine, the second candidate video segment.

In order to reduce the amount of computation and improve the accuracy, only a part of the suggestions may be selected for subsequent processing. As shown in fig. 7, this step is used to select a satisfactory suggestion, that is, a satisfactory suggestion shown in the figure, that is, a suggestion hypothesis for determining which level is selected, and then the next step is performed to continue processing, that is, a target level is determined, and a video segment corresponding to each action suggestion hypothesis belonging to the target level is determined as a second candidate video segment.

Specifically, the target level may be adaptively determined according to the time duration of the video that the user wishes to generate last (i.e., the input video length shown in fig. 5, the input video time duration shown in fig. 7). If the user does not give the duration of the input video, the duration can be considered infinite in the actual processing, and the step is omitted, or the step can be processed according to the preconfigured duration instead of the duration of the input video. In practical applications, if the preconfigured duration is set, different selectable durations may be configured for different application scenes, for example, the corresponding preconfigured duration may be selected according to a scene of a video that a user wants to generate.

In correspondence with the previous step, the embodiments of the present application also propose two proposed selection strategies, which may be referred to as strategies Z1 and Z2, respectively, which will be described below.

a. Strategy Z1

This strategy Z1 is the corresponding selection strategy of strategy J1. The five proposed actions described above are illustrated by assuming P1 (1-14 s), P2 (2-8 s), P3 (3-8 s), P4 (11-20 s), and P5 (21-25 s). The following suggested ratings have been derived based on the above strategy J1:

p1 (1-14 s) is class 1, P2 (2-8 s) is class 1 and class 2, P3 (3-8 s) is class 1 and class 2, P4 (11-20 s) is class 1, and P5 (21-25 s) is class 2.

Assuming that the input video duration (the length of the video that the user wishes to generate last) is T ', by adding an adjustment factor n, the length limit of each video segment, i.e. the video limit duration T ═ T'/n, can be obtained. n is an integer not less than 1, and in practical applications, empirical values of 2 to 5 may be used according to the length of an input video, which is convenient for description, and n is 2 for example. In this example, it is assumed that the average duration of level 1, i.e., the duration threshold, is 10 seconds, and the average duration of level 2 is 5 seconds.

When the target level is determined, the average time length of T and the average time length of level 1 are compared from level 1, the video length limit (namely the video limit time length) T is 15 seconds if the time length of the video input by the user is 30 seconds, and is more than 10 seconds, the level 1 (namely the level 1 is selected as the target level), and the corresponding suggestion hypothesis of the level 1 is selected to be sent to the next step, and the step processing is exited. If the user input video duration is 18 seconds, the video length limit T is 9 seconds, less than 10 seconds, but not less than the average duration of level 2 for 5 seconds, as shown in fig. 8, N1 and N2 may be compared, where N1 represents the number of recommendations belonging to level 1 and having a duration less than T, as shown by the shaded portion in fig. 8, N2 represents the number of recommendations belonging to level 2 and having a duration less than T, if N1> N2, level 1 is selected, indicating that even a portion of level 1 may be guaranteed to provide a sufficient number of recommendations that meet the duration requirement, otherwise, level 2 is selected, i.e., the recommendation less than T in level 2 is selected. If there are more than two levels, the expansion can be performed according to the same method. For example, if the average duration of level 1 is greater than T and the number of suggestions in level 1 having a duration less than T is less than the number of suggestions in level 2 having a duration less than T, then the processing of level 2 and level 3 may be entered, and based on the number of suggestions in level 2 having a duration less than T and the number of suggestions in level 3 having a duration less than T, it is determined whether to determine level 2 as the target level or level 3 as the target level.

Once a certain level is selected, the suggestion of the transition duration region corresponding to the level that meets the condition (i.e., the duration is less than the video limit duration) uses the level as the level at the time of the subsequent processing. That is, if the duration of an action suggestion falls within a transition duration interval and one of two levels corresponding to the transition duration interval is determined as a target level, if the duration of the action suggestion is less than the video limit duration, the action suggestion result is also used as an action suggestion result corresponding to the target level, and a video clip corresponding to the action suggestion result is also used as a candidate video clip to be subjected to subsequent processing. For example, when the above level 1 is selected as the target level, the action advice whose duration is in the transition duration region [ t1, t2] in fig. 8 belongs to both level 1 and level 2, but is finally processed as the advice of level 1.

With respect to the corresponding strategies J1 and Z1 described above, the present embodiment uses a gaussian distribution to fit training data of different grades, wherein the decision interval of the grade, i.e., the duration interval [ mean-N-determination, mean + N-determination ] has a key role. In practice, when a smaller N (e.g., 0.5) is selected, the suggestions for each level will tend to converge, i.e., the range of long intervals is smaller, so that only suggestions that are very close to the average duration of a level will be assigned to that level. The processing in this way can make the grade division more strict, the suggestion of each grade becomes less, the data volume needing to be processed subsequently is reduced, the subsequent rapid processing is facilitated, and meanwhile, the classification precision can also be improved. On the other hand, if the diversity of the segments needs to be considered, a larger N (e.g., 5) may be used, and at this time, more suggestions are assumed to be in the transition duration interval, which may belong to multiple levels, and meanwhile, the suggestions selected for the next classification processing are increased, and the content of the obtained video segments is richer. In practical application, the value of N can be determined according to different application requirements.

b. Strategy Z2

This strategy is the corresponding selection strategy of strategy J2. For the above recommendations P1-P5, based on strategy J2, recommendations P1 and P4 are class 1 and P2, P3 and P5 are class 2. A schematic flow chart of target level selection based on the policy Z2 is shown in fig. 9, and is still described by taking the two levels of level 1 and level 2 as examples, and the average duration of the levels and the size of the video length limit are compared from level 1 as shown in fig. 9. The method comprises the following specific steps:

if the average duration of the current level is not greater than the video length limit, the suggestion for that level is selected. For example, the average duration of level 1 is 10 seconds, and the video length is limited to 15 seconds, then level 1 satisfies the condition, as the target level, the action suggestions P1 and P4 belonging to the level are selected and the step is exited.

If the average duration of the current level is greater than the video length limit, the average duration of the next level needs to be compared until the average duration of a certain level is less than the video length limit. For example, if the video length limit is 7 seconds, and the average duration of level 1 does not satisfy the condition, the video length limit is compared with the average duration of level 2, i.e., 5 seconds, and level 2 satisfies the condition, i.e., the average duration of level 2 is less than the video length limit, the level 2 action suggestions P2, P3, and P5 are selected and the step is exited.

Step three corresponds to step S3 shown in fig. 5: and (4) classifying action types.

The purpose of this step is to determine the action type of each action suggestion hypothesis screened out in step S2. Specifically, for a certain suggestion, the suggested feature of the action suggestion (i.e. the feature of the video segment corresponding to the action suggestion) is transmitted to an FC Layer (full connection Layer, FC Layer, full connection Layer), an output feature vector is obtained based on network parameters (weight and bias) of the FC Layer, specifically, the FC Layer calculates a value of weight × suggested feature + bias, the feature vector output by the FC Layer is classified by a Softmax Layer, and an action classification result corresponding to the action suggestion result is obtained, specifically, a label of a specific action classification result is specifically output by the Softmax Layer, that is, the label of a specific action type may also be a classification result vector, where the vector is specifically a one-dimensional column vector, the number of elements in the vector is equal to the number of categories of all action types, and the value of each element in the vector may be the probability that the action type corresponding to the action suggestion is of each type. For an action suggestion, according to the start time and the end time of the action suggestion, the feature of the corresponding time interval in the dual-stream features is intercepted, so as to obtain the segment feature (the suggested feature shown in fig. 5) of the video segment corresponding to the action suggestion, and the feature tensor of the segment feature may be represented as P × H × W × C, where P, H, W, C is the frame number, the height of the feature map, the width of the feature map, and the number of the feature maps included in the video segment corresponding to the action suggestion, respectively. Of course, the corresponding video segment in the video may be captured according to the start time and the end time of the motion, and then the feature of the video segment is extracted to obtain the segment feature of the video segment.

For level 1 and level 2 described above, assuming that the level 1 action suggestion is selected, this step uses the segment characteristics of the video segments corresponding to the level 1 action suggestion hypotheses to perform action type classification on the level action suggestions, for example, P1 is classified as "basketball shooting" and P4 is classified as "swimming". Similarly, if the action suggestion at level 2 is selected, the action suggestion is classified into action type by using the segment features corresponding to the action suggestion hypotheses at level 2, for example, P2 is classified as "floating", P3 is classified as "dribbling", and P5 is classified as "shooting". It will be appreciated that only the selected suggestions are sorted in this step. If level 1 is selected in S2, the level 1 suggestions P1 and P4 may be classified, with P2, P3, and P5 being ignored.

Based on the multi-level behavior recognition scheme provided by the embodiment of the application, action segments of different levels and categories can be obtained by segmenting according to the duration of the input video, the video segments entering the subsequent processing step can be determined based on the corresponding processing strategy, and action types corresponding to the video segments can also be obtained. As an example, for a video 001, the information of the relevant action segment of each screened-out video segment may include information as shown in fig. 10, and may include, but is not limited to, time information of the segment, the time information of the first video segment as shown in the figure includes a start time (1.5 th second) and an end time (5 th second), and may further include the type of action in the video segment, i.e., "smile".

1.3. Segment selection

The purpose of this section is to screen segments related to the user's intention, that is, based on intention characteristics, each candidate video segment obtained in the foregoing sections 1.1 and 1.2 is screened to screen out a target video segment. In the flow chart shown in fig. 3, the segment selection is used for screening each motion candidate video segment, and it should be noted that, in practical application, the segment selection may be used for screening the human character candidate video segment and/or the motion candidate video segment in the foregoing. The following description will take the motion candidate video clip as an example.

Since the degree of correlation (also referred to as correlation) between two vectors can be used to measure the similarity between the vectors, and the greater the degree of correlation, the more similar the two vectors are, therefore, when selecting a segment based on the intention characteristics, as an alternative, the segment screening can be performed by calculating the degree of correlation between the intention characteristics of the user and the segment characteristics of each candidate video segment.

Fig. 11 shows a schematic diagram of segment selection, and as shown in the diagram, based on the segment features (segment feature vectors shown in the figure) and the intention features of the motion candidate video segments corresponding to the motion suggestion hypotheses belonging to the target level obtained by the motion recognition in the foregoing, the correlation between the segment features and the intention features of each motion candidate video segment is respectively calculated, and the motion candidate video segment corresponding to the higher correlation is taken as the filtered video segment, that is, the target video segment.

It should be noted that, in practical applications, when calculating the correlation between the segment feature of a candidate video segment and the intention feature, the segment feature of the candidate video segment may use the feature of the corresponding time period (referred to as a first feature) obtained by extracting the feature again from the dual-stream feature of the candidate video based on the start time and the end time corresponding to the motion suggestion hypothesis, may also use the segment feature obtained by performing feature extraction again on the candidate video segment (referred to as a second feature), may also use the feature obtained by performing feature conversion again using the first feature and/or the second feature, for example, a feature vector output by the FC layer when the motion type classification may be used, or a classification result vector output by the Softmax layer. When calculating the correlation between the segment feature and the intention feature, if the segment feature and/or the intention feature is a multi-dimensional feature vector, the multi-dimensional feature vector needs to be converted into a one-dimensional feature vector, and then the correlation is calculated. The screening scheme of the target video segment is explained in detail below.

a. Correlation calculation

Whether a segment is relevant to the user's intention can be generally determined by calculating the similarity of a segment feature vector (i.e., the form of the segment feature) and an intention feature vector (the form of the intention feature), and it is necessary that, in the following description of the relevance, both the segment feature vector and the intention feature vector are one-dimensional feature vectors.

As an alternative, the correlation between the intention feature and the segment feature may be implemented based on a correlation calculation model of the attention mechanism, specifically calculated by the following expression:

c(f_v，f_intention)＝v^Ttanh(Wf_v+Vf_intention+b)

wherein f is_vFeature of segment representing candidate video segment, f_intentionFor the intended feature, c (f)_v，f_intention) Denotes f_vAnd f_intentionW is f_vW is f_intentionB is a bias vector, and v is^TIs a weight matrix of the features. Tanh is the activation function, used here for normalization of the feature vector, for normalizing the feature vector to between (-1, 1). Specifically, W, V, b, and v^TNetwork parameters of a correlation calculation model based on an attention mechanism can be obtained through learning in model training, W, V are respectively used for converting segment features and intention features into the same feature space, v^TThen it is used to convert the dimension of the normalized feature vector to the specified dimension. The calculation of the correlation is explained in detail below with reference to this expression:

as an example, fig. 12a is a schematic diagram illustrating a process for determining characteristics and intention characteristics of a video segment according to an embodiment of the present application, and as shown in fig. 12a, the process mainly includes the following steps:

1. and converting the feature vector of the video segment feature and the feature vector of the intention feature into the same feature space.

Since the intention feature is located in the intention feature space, the video segment feature is located in the visual feature space, and the vector of different spaces cannot be directly correlated, in order to calculate the correlation, the video segment feature and the intention feature should be first converted into the same feature space, if the video segment feature and the intention feature can be subjected to feature extraction again through the same feature extraction network, the feature vector located in the same feature space is obtained, and then subsequent processing is performed based on the feature vector of the converted video segment and the feature vector of the converted intention feature.

As shown in fig. 12a, the feature vector of the feature of the video segment is assumed to be a feature vector of n x 1 dimension, i.e. a column vector of n dimension, corresponding to f shown in the figure_v(n x 1), the feature vector of the intended feature is a d x 1 dimensional feature vector, i.e., a d dimensional column vector, corresponding to f shown in the figure_intention(d 1), the arrows in the figure respectively represent the direction vectors of the corresponding feature vectors, and f is converted into the parameter matrix through the features_v(n 1) and f_intention(d 1) converting to the same feature space A to obtain f_v(n x 1) feature vectors Wf corresponding to the feature space A_v(m 1) to yield f_intention(d 1) the eigenvectors Vf corresponding to the eigenspace A_intention(m x 1), i.e. the transformed eigenvectors are m-dimensional column vectors, in particular, W is a matrix of m x n, i.e. a parameter matrix of m rows and n columns, f_v(n x 1) is a matrix with n rows and 1 column, and the multiplication of the two matrixes obtains a new matrix with m rows and 1 column, namely Wf_v(m 1), likewise, V is a matrix of m d, which is then connected to f_intention(d x 1) to obtain a matrix of m rows and 1 column, namely Vf_intention(m*1)。

After conversion, the initial correlation of the two characteristic vectors can be calculated in the same characteristic space A, for the two characteristic vectors, if the correlation degree of the two characteristic vectors is larger, the two characteristic vectors should have similar directions in the same characteristic space, and the direction difference of the two characteristic vectors in the same characteristic space is larger, and the direction difference of the two characteristic vectors should be smaller. Thus, the correlation of the two vectors can be characterized by the sum of the two vectors. As shown in fig. 12B and 12C, the sum of the vector a and the vector B is the vector C, the sum of the vector D and the vector E is the vector F, the directions of the vector D and the vector E are the same, and the directions of the vector a and the vector B are different greatly, so that the magnitude of the vector F is larger than that of the vector C, and the correlation between the vector D and the vector E is larger than that of the vector a and the vector B.

As an example, assuming that the user's intention is to generate a video regarding swimming, wherein one video segment 1 is a video segment regarding freestyle swimming and the other video segment 2 is a segment regarding basketball, the length of the sum vector of the feature vector of the segment feature of the video segment 1 and the feature vector of the user's intention feature mapped to the same feature space is greater than the length of the sum vector of the feature vector of the segment feature of the video segment 2 and the feature vector of the intention feature.

Thus, for the feature vector Wf_v(m 1) and eigenvector Vf_intention(m 1), can pass Wf_v+Vf_intention+ b to determine the correlation between the two, wherein the feature vector b, i.e. the offset b, is a feature offset, so that the above formula, i.e. the correlation calculation formula, is at Wf_v+Vf_intentionAnd the robustness is better under the condition of 0 equal pole. Wf_v+Vf_intention+ b denotes the sum of Wf_v(m 1) and Vf_intention(m x 1) the sum of the eigenvalues of the corresponding positions in these two eigenvectors, and then add the eigenvector b to get a new m-dimensional column vector, which can be understood as the correlation vector of the segment eigenvalue and the intention eigenvector.

Further, since the sum of the feature vectors varies widely, the network model is difficult to learn, and therefore, it can be normalized to (-1,1), such as by the above-mentioned activation function tanh function, and the sum of the feature vectors can be normalized to (-1,1), and then, it can also be normalized by the weight matrix V^TThe normalized m-dimensional column vector tanh (Ef)_v+Vf_intention+ b) is a column vector of dimension T, in particular, V^TIs a matrix of T x m, i.e. T rows and m columns, by multiplication of two matrices, i.e. V^Ttanh(Ef_v+Vf_intention+ b) to obtain the calculation result c (f) of the correlation_v,f_intention) I.e. a matrix of T x 1, i.e. a column vector of dimension T. Optionally, for convenience of calculation, the value of T may beIs 1, then c (f)_v,f_intention) I.e. a value, if T is an integer greater than 1, then c (f)_v,f_intention) Is a column vector.

Of course, as another alternative, tanh (Wf) may be directly introduced_v+Vf_intentionThe result of the calculation of + b) is c (f)_v,f_intention)。

b. Selection of video segments related to user intent

After the correlation degree corresponding to each candidate video segment is obtained in step a, the video segment with the correlation degree greater than the threshold value 1 may be selected as the selected video segment based on the threshold value 1. As an optional example, fig. 13 shows a schematic flow chart of a manner of screening each candidate video based on a correlation (a correlation value shown in the figure), as shown in the figure, in this manner, a threshold value 1 in the manner is used, for each candidate video segment, a correlation value corresponding to each candidate video segment is calculated based on a segment feature vector and an intention feature vector, if the correlation value is greater than the threshold value, the corresponding video segment may be regarded as a screened video segment, and if the correlation value is not greater than the threshold value, the corresponding video segment is filtered out. The specific value of the threshold 1 may be determined according to an empirical value and/or an experimental value, for example, the specific value may be set as an average correlation of the training data, that is, an average value of the correlations between the segment features of each candidate video segment in the training data and the user intention features.

It should be noted that, when the calculation result of the correlation degree is a numerical value, the threshold 1 may be a specific numerical value, and when the calculation result of the correlation degree is a column vector, the threshold 1 may be a column vector of the same dimension, that is, the threshold may be a reference vector, and whether the corresponding candidate video segment can be used as the target video segment may be further determined by calculating the similarity between two column vectors, for example, when the distance between two column vectors is smaller than the set distance, the corresponding candidate video segment may be determined as the target video segment.

The clip diversity processing shown in fig. 13 may be optional processing, that is, may not be performed, and the video clip screened based on the correlation degree is used as the target video clip, or may be performed, that is, the processing of this step is performed after each video clip screened based on the correlation degree, and each video clip processed in this step is used as the target video clip, and this clip diversity processing step will be described below.

c. Fragment diversity optional processing

Assuming that the user wants to see the video about "basketball", most of the segments screened out are about shooting by calculating the correlation in the last step. This causes a problem that most of the selected segments are related to shooting, and if the target video is directly generated based on the segments, the user feels that the video content is relatively single because the video segments with large correlation may be relatively similar to the user's appearance. Considering that a user may want to see various materials about basketball, the embodiment of the present application designs a processing method of segment diversity, which is based on the principle of performing correlation calculation comparison between a segment to be processed and a segment that has been filtered out in each candidate video segment.

The core idea of the processing method of the fragment diversity is to calculate the correlation degree between the fragment pairs. The smaller the correlation value, the more different the segment is from other segments, and the more irrelevant it is. The segment with the minimum correlation degree is searched, so that the segments with different contents can be found as far as possible. Taking the above basketball as an example, the segments belonging to different sub-categories of the basketball category, such as shooting, dribbling, and capping, can be finally screened out. A detailed flow diagram of the method is shown in fig. 14, and as shown in the figure, the detailed process flow may include the following steps:

1) selecting a segment with the maximum corresponding relevance value from the candidate video segments (the video segment candidates shown in the figure) obtained in the last step b as a screened segment (i), namely, using the candidate video segment with the maximum relevance between the segment characteristics and the intention characteristics in each candidate video segment screened in the last step as an initial screened video segment;

2) and calculating the correlation between other segments in the candidate video segments and the screened segments. If the correlation degree is not less than the set threshold value 2, the segment is regarded as a similar segment of the screened segment, and the segment can be ignored, otherwise, the segment with the minimum correlation degree is selected from all the segments with the correlation degree less than the threshold value 2 and is classified into a screened segment set II;

3) and repeating the process 2), namely, calculating the correlation between other segments in the candidate video segments and the segments which are already screened (including all the segments in the screened segment I and other segments screened by the process 2, namely segment combination II), screening the segment with the corresponding correlation smaller than the set threshold 2 and the minimum correlation, and continuously repeating the process until all the candidate video segments are processed. The final desired video segments are: fragments (i) and fragments in a set (ii).

The setting method of the setting threshold 2 may be the same as the setting method of the setting threshold 1, that is, based on training data.

1.4. Fragment fusion

By combining the information of the person video slices (namely the person video clips) and the information of the action video slices (namely the action video clips), the information of the action clips of the persons, namely the video clips of a certain action of a certain person can be obtained through intersection processing of the slice sets. As an alternative fusion method, for each video belonging to the same candidate video, fusion can be performed based on information of a common time period in both the person video segment and the motion video segment, for example, in a person slice shown in fig. 15, when a person a appears in 0.5 to 5 seconds in a video 001, and a time of 1.5 to 5 seconds in the video 001 is a motion "smile", so that the two slices are fused, and information of an optimal video segment (a video segment after fusion) can be obtained as follows: person a-video 001-1.5-5 seconds-laughs, i.e., person a laughs in the 1.5 th to 5 th second segment of video 001. Similarly, all the person slices and the action slices can be processed to obtain the optimal video segment, namely the video segment containing a certain action of a certain person.

It can be understood that, in practical applications, the rule of segment fusion can be configured according to actual requirements, the above-described fusion manner is only an optional manner, for example, a duration threshold can also be configured, for a character segment a and an action segment b belonging to the same video, if the common duration of the two corresponding to the video is greater than the duration threshold, the action segment b can be considered as the action segment of the character in the character segment a, the segments corresponding to the common duration in the character segment a and the action segment b can be fused to obtain a certain video segment for doing something, and further, based on this manner, the same action segment can correspond to one or more character segments, for example, for videos for meeting multiple characters, there are likely video segments for multiple characters doing the same action in the same time period, for example, in a certain time period, multiple people are smiling at the same time, and the motion segments that are smiling may correspond to the multiple people.

In summary, the video generation method based on multi-level motion recognition provided by the embodiment of the present application has at least the following advantages:

the multi-stage motion recognition can adaptively segment the segments and levels of motion according to the input duration of the user. When a user needs a longer generated video, the scheme can quickly segment out a longer video segment and obtain a coarse-grained action category. For stricter time length constraint, the scheme can cut fine-grained motion video clips. The self-adaptive segmentation scheme has two advantages, on one hand, classification processing can be performed only on action suggestions of a certain level quickly and accurately, therefore, action fragments can be classified quickly and accurately, subsequent data processing amount can be effectively reduced, video generation efficiency is improved, the existing action detection technology can classify the action suggestions of all granularities, operation speed is slow, and meanwhile, classification accuracy is seriously influenced due to the large number of the suggestions needing to be classified. On the other hand, for a shorter video generation requirement, fine-grained action type classification can be selected, the types can be fine and diversified, and finally, the generated video content is more diversified and attractive.

A video clip relevant to the user's intent is screened using a clip selection process. Compared with the prior art, the scheme provided by the embodiment of the application can obtain the video clip desired by the user, can better meet the user requirement, has great practical value, and can greatly improve the user experience. In addition, the application also considers the diversity of the video content. In the video clip aggregation processing, key semantic information of the video about people and actions can be obtained, namely, a clip that someone does something can be segmented, which is core content understood by the video, so that a target video more meeting the needs of a user can be generated, which is not available in the existing scheme.

Based on the above video segment aggregation and video generation scheme provided by the present application, the following technical effects can be obtained at least:

-obtaining a segmented piece of someone doing something

-collecting the video segments required by the user

The segmented segments may have diverse video content. For example, where a user wishes to see a "basketball" video, the segmented video segments may include various fine movements for playing basketball, such as "shooting," "capping," "dribbling," etc., rather than just a shooting video.

It should be noted that the multi-stage operation recognition, segment selection, person recognition, and other processing methods mentioned in the above embodiments of the present application may be used independently or in combination. The video generation scheme provided by the embodiment of the application can be suitable for scenes and functions such as video editing, video story generation, video retrieval and the like, has wide application prospects in hardware products with video playing functions such as smart phones, tablet computers and computers, and can be applied to background servers, and the servers can generate various videos with rich contents according to user requirements. Based on the scheme of the embodiment of the application, the video experience of the user can be greatly improved.

For example, based on the scheme provided by the embodiment of the application, the user intention can be quickly and accurately understood, the video desired by the user can be found based on the user intention, and the accurate retrieval of the video is realized; related videos can be cut into short video films with certain inherent continuity, and the videos can be edited quickly; the method can also add special video effects intelligently according to the generated video content, realize intelligent creation, such as adding special effects to the generated video, or replace a certain content in the video picture, specifically, for example, the generated video with a specified character riding at seaside is targeted, and based on the video, the user can further perform intelligent creation, such as editing the video into a video riding in a rural scene. Therefore, based on the scheme provided by the embodiment of the application, the user can obtain videos with rich and diversified contents according to respective requirements, each instant segment of the generated video is very interested and memorable, and valuable recall videos are provided for the user.

The following describes an application of the video generation scheme provided in the embodiment of the present application with reference to a specific application scenario:

a. video multi-level action retrieval in application scenario 1-smartphone

Assuming that a user wishes to retrieve a video related to basketball in an album of a mobile phone, based on the video generation scheme provided by the embodiment of the present application, the user can obtain the following experience:

-the user inputs the content to be retrieved either vocally or directly, the term "basketball" as shown in fig. 16;

the mobile phone program performs correlation calculation on the videos stored in the mobile phone book to find a needed video, that is, performs pre-screening on the videos in the mobile phone based on the user requirement to obtain candidate videos, and then applies a multi-stage motion recognition algorithm to the videos, as shown in the flowchart shown in fig. 16, and as a result of processing of the multi-stage motion recognition, the following video segments can be obtained:

the photo album has 30 segments of long video related to basketball shooting, wherein the shooting segment is 11 segments, the dribbling segment is 16 segments, and the cap segment is 3 segments.

This application scene is very convenient and practical to the user who likes the basketball, and he can look over the relevant video of shooting before fast, selects the splendid moment of oneself liking, then shares for friend, also can look back scene at that time, perhaps think about athletic performance at that time. b. Character video retrieval in application scene 2-smart phone

Assuming that the user wishes to retrieve his child video in the photo album of the mobile phone, the user may get the following experience:

-the user specifies information (such as an avatar, etc.) of the person to be retrieved;

the mobile phone program performs person recognition on the videos in the album, finds the videos of the person, and then applies a multi-level motion recognition algorithm to the videos, as shown in fig. 17, based on the processing of person recognition and multi-level motion recognition, the following video segments of the user-specified person can be returned:

basketball in the photo album for 30 seconds (not shown), riding for 15 seconds, sand in the photo album for 22 seconds, etc.

The function is very convenient and practical for the user, and the user can find the interested person, look up the related wonderful moment, and then share the wonderful moment with friends or bring sweet memories. The video clip is particularly suitable for viewing the video clips of lovers and family from a large amount of disordered videos.

As an example, the video segment aggregation module provided in this embodiment of the present application may be deployed at a smartphone end, the smartphone end after the module is deployed may perform character recognition and multi-level action recognition processing on multiple segments of videos of a user stored in the smartphone end after receiving a video generation request of the user, obtain video segments, and display the video segments on a terminal interface, for example, each video segment may specifically show a segment division result of a certain person doing something, specifically, for example, character information (such as a character avatar) appearing in a user video may be displayed on a screen of the terminal, the user may click the avatar of the certain person to enter a corresponding personal page, display related video segments related to the person, and classify the video segments according to action (scene) categories, different categories of video clips are shown on the screen of the terminal.

Therefore, based on the scheme provided by the embodiment of the application, people and fragments appearing in the video of the photo album of the terminal equipment can be displayed, the action classification labels of the fragments can be checked, the wonderful video with the given time length can be automatically generated, and the generated interesting video can be further shared.

The scheme of the embodiment of the application can also be applied to an application program with a video generation function, and can display related information of various video clips for generating the target video through a user interface of the application program, such as character information appearing in the video, character highlight video clips, all action video clips of a specific character, and the like.

The Attention-based video generation (Attention-based video generation) method provided in the embodiments of the present application is described in detail below with reference to an example.

As can be seen from the foregoing description, the existing video generation method based on ranking generally does not consider semantic information of the video and topics of video contents, and only relies on simple rules. There are two main problems with this existing video generation approach: firstly, different rules need to be summarized in order to deal with different application scenes, and for some special or complex application scenes, it is very difficult to extract and summarize video generation rules from complicated and fussy data; secondly, videos of users, namely candidate videos, are often rich in themes, rich and colorful in content, and may cover travel, family gathering, daily life and the like, a rule needs to be allocated to each video theme in the existing scheme, which may result in huge workload, and a large amount of memory space of equipment may be occupied in extracting the rule, so that it is difficult to generate videos representing numerous video contents of users in the existing video generation mode based on simple rules, and the generated videos may not meet the user requirements.

In order to solve the problem, the invention provides a video generation method based on an attention mechanism, which can automatically edit a video segment based on video content, user intention and the space-time characteristics of the video segment to generate a video desired by a user.

As an example, fig. 18 illustrates a flowchart of a video generation method based on an attention mechanism, which may be specifically implemented by a video generation network with an attention mechanism (an attention module shown in the figure), where as shown in the figure, the video generation network (an LSTM decoding network based on an attention mechanism shown in the figure) in this example includes an attention module, an LSTM (long short-term memory) network, and a generator, and an input of the video generation network includes segment features of video segments to be processed (which may be candidate video segments, video segments selected from candidate video segments based on a correlation between an intention feature and the candidate video segments, and video segments processed by segment diversity), a user intention feature, a user intent feature, and a video segment identifier, And the length of the video that the user wants to generate (i.e. the video duration in the foregoing, the duration of the input video), the output is a piece of video that the user wants, i.e. the target video. The attention module combines the user intention (specifically intention characteristics) with the video content (specifically segment characteristics of candidate video segments, which may also be referred to as segment characteristic vectors), and the LSTM network finds the optimal segment arrangement order according to the rich training data and considering the internal connection between the segments.

Wherein the segment characteristics of each video segment are obtained in the video aggregation module described above. Assuming that T video segments are obtained, their segment characteristics are h₁,...,h_TI.e. h_jThe segment characteristics of j (1 is not less than j and not more than T) video segments are represented, and the user intention characteristic is recorded as f_intentionThe length of the video that the user wants, i.e. the video duration, is L, and the specific steps of the scheme are as follows:

a. this step is an optional step of filtering each video segment by a Person-based Filter (Person Filter). For example, if a user wants to generate a video that specifies certain people (e.g., the user provides images of target people included in the video that the user wants to generate), the filter may be used to filter video segments, and if not, the filter may not be used.

b. The desired video is generated by an attention-based LSTM decoding network.

Specifically, the attention module may calculate the attention weight of each video segment using the following formula:

e_t,j＝v^Ttanh(Ws_t-1+Vh_j+Kf_intention+b)

wherein e is_t,jRepresenting the attention weight, s, of the jth video segment of the LSTM network at time t_t-1Hidden state variables (which may also be referred to as hidden state features, hidden state feature vectors, hidden representations, hidden state variables, etc.) representing the LSTM network at time t-1, such as s shown in FIG. 18₁、s₂、s_n-1Hidden state variables representing the initial time, the second time and the (n-1) th time, respectively, W, V, K representing s_t-1、h_jAnd f_intentionB is the bias of the attention module, Tanh is the activation function, here for normalization of the eigenvectors, for normalization of the eigenvectors to between (-1,1), v^TAnd the weight matrix is a weight matrix of the attention module and is used for converting the characteristic dimension of the normalized characteristic vector into a specified dimension. Wherein, W, V, K, v^TThe network parameters of the attention module can be obtained through training.

It should be noted that, for the relevant description of LSTM in this example, the described time or time (e.g. time t, etc.) is a relative time concept, which refers to each time step in the neural network processing process, corresponding to each operation executed in sequence as described above, such as s_t-1I.e. can be understood as the t-1 st operation, e_t,jThe attention weight of the jth video segment at the time of the tth operation.

As an example, fig. 19 is a schematic flowchart illustrating a method for calculating attention weight based on user intention characteristics provided in an embodiment of the present application, and as shown in fig. 19, the flowchart mainly includes:

first, in order to calculate the attention weight, it is necessary to convert the features of different feature spaces into the same feature space to calculate their correlation, and then convert the feature vector into T × 1 to obtain the attention weight of each input video segment.

Specifically, as shown in fig. 19, taking the jth video segment as an example, the feature vector of the segment feature of the video segment is a feature vector of l × 1 dimension of the video feature space, which corresponds to h shown in the figure_j(l 1), the feature vector of the intention feature is a d x 1 dimensional feature vector of the user intention feature space, namely a d dimensional column vector, corresponding to f shown in the figure_intention(d*1)，s_t-1The state feature vector is a state feature vector of a state feature space, and the feature dimension of the state feature vector is n x 1. For each eigenvector, s may be transformed separately by the eigenspace transform parameter matrix W, V, K_t-1、h_jAnd f_intentionConverting to the same feature space (such as feature space A shown in the figure) to obtain feature vectors with dimensions of m × 1 after conversion, namely Ws shown in the figure_t-1(m*1)、Vh_j(m 1) and Kf_intentionThe arrows in the figure indicate the direction vectors of the corresponding feature vectors, respectively.

After converting each feature vector to the same feature space, pass Ws_t-1+Vh_j+Kf_intention+ b can calculate the correlation between the three, as can be seen from the above description of the correlation between features, if the correlation between multiple feature vectors is large, the feature vectors should have relatively close directions in the same feature space, and their corresponding ratios should be relatively large, so that we can pass through Ws_t-1+Vh_j+Kf_intention+ b calculates the correlation between the three, and the offset vector (the eigenvector b shown in the figure) is the offset, which is the expression of Ws_t-1+Vh_j+Kf_intentionThe robustness is also better under the extreme condition of 0 and the like.

Since the sum of the feature vectors has a large variation range, the network is more difficult to learn the network parameters during training for the attention module, so the sum of the above calculated feature vectors can be normalized to (-1,1) by the excitation function tanh (), i.e. by the expression tanh (Ws)_t-1+Vh_j+Kf_intention) Is calculated toTo the eigenvector with dimension m x 1 with element value range (-1,1), and then, further converting the matrix v through the characteristic dimension^TAnd converting the dimension of the feature vector corresponding to the correlation degree from m × 1 to T × 1.

After determining the attention weight of each video segment at the current time t, the attention weight may be normalized to [0, 1], where the normalization formula is:

at this time, the feature vector input to the LSTM network, i.e., the feature vector output by the attention module, is a weighted sum of the feature vectors, and the weighted video segment input to the LSTM network at time t has the following features:

in the example shown in FIG. 18, g₁、g₂、g_nThe weighted video segment characteristics at the initial time, the second time and the nth time are respectively represented, namely the weighted video segment characteristics corresponding to the first operation, the second operation and the nth operation.

Through the LSTM network and the generator, the index of the video segment selected at time t can be obtained by the following calculation:

s_t＝LSTM(y_t-1，g_t，s_t-1)

y_t＝Generate(s_t，g_t)

wherein, at time t, the input information of the LSTM network comprises y_t-1、s_t-1And g_tThe output information is s_tI.e. the hidden state characteristic at time t, represents the information recorded by the LSTM network at that time, y_t-1Indicating the index (i.e. identity) of the video segment output by the generator at the previous moment, i.e. at time t-1, i.e. the target determined by the previous operationAn index of the video segment; the input information of the generator, Generator, includes s_tAnd g_tThe output information is the index of the selected video segment obtained by the generator at the time, i.e. y_t. For example, if the 3 rd video segment in the input video is selected at time t, then y_tIs 3. In the example shown in FIG. 18, y₁、y₂、y_nThe indexes of the selected video are respectively represented at the initial time, the second time and the nth time.

c. And if the generated video length is larger than or equal to the video length L desired by the user at the time t, stopping the generation network, and at the moment, outputting the video formed by the video clips according to the time sequence to be the generated video, namely the target video.

In the video generation scheme provided by the embodiment of the application, the main input of the attention-based LSTM decoding network is the sliced video segments, and the output is the video composed of the selected video segments. Assuming that there are 10 input video segments, each associated with "basketball", wherein the 1 st video segment is a "shot" segment, the 3 rd video segment is a "take ball" segment, and the 5 th video segment is a "cap" segment, the goal of the network is to generate a video containing "take ball" (t-2) ->"shoot" (t-1) ->"capping" (t), that is, the three video segments included in the target video to be generated according to time sequence are the "ball" segment, "shooting" segment, and "capping" segment, then the output of the LSTM network should be 3(t-2) ->1(t-1)->5(t), i.e. the index 3 of the "shot" segment is output at time t-2 of the LSTM network, the index 1 of the "shot" segment is output at time t-1, and the index 5 of the "cap" segment is output at time t. Specifically, taking time t as an example, input g of the LSTM network_tIs the weighted feature of the 10 input video segments, and the weight is calculated according to the input video feature, the user intention feature and the hidden state variable of the LSTM at t-1 time; at this time, y_t-1Is the index of the last time selected "shot" video clip, which should be 1. s_t-1The information recorded by the LSTM from the start to time t-1 can be simply understood as the "shot" in the last time. Then at time t the output of the LSTM network is the index of the video segment "cap", which here should be 5.

As can be seen from the above description, the video generation method based on attention mechanism proposed in the embodiment of the present application mainly includes two parts: an attention mechanism and an LSTM network, the scheme adds the user intention to the calculation of the attention weight, but the existing video generation method does not add the feature to the calculation of the attention weight; in addition, the LSTM network based on the attention mechanism is used to generate a video desired by a user, which can achieve a better video generation effect than the rule-based method.

Specifically, for example, 4 videos are respectively taken at different times (2016 spring, 2017 autumn, 2018 winter, 2019 summer) in the same place. If the existing time sequencing method is applied, only the sequence of' 2016-. In addition, the LSTM network may learn many movie-level cut-and-splice techniques, such as switching between close-range and distant-range scenes, or more complicated montage technique, inserting another segment representing a mood scene in the general video main line, and so on. Therefore, a video with better effect and better accordance with the user expectation can be obtained based on the video generation mode based on the attention mechanism (with the user intention characteristics being integrated) provided by the embodiment of the application.

Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application also provides a video generation apparatus, as shown in fig. 20, the video generation apparatus 100 includes an intention extraction module 110 and a video generation module 120. Wherein: an intention extraction module 110 for extracting an intention feature of the video generation request;

and a video generation module 120, configured to generate a target video based on the intention features and the candidate videos.

Optionally, the intent feature includes an action intent feature.

Optionally, the video generating module 120 is specifically configured to:

respectively extracting the video characteristics of each candidate video;

target videos are generated based on the target video segments.

Optionally, for a candidate video, when determining a candidate video segment in the candidate video based on the video feature of the candidate video, the video generation module 120 may be specifically configured to perform at least one of the following:

performing object recognition on the candidate video based on the video characteristics of the candidate video to obtain each first candidate video segment containing a target object;

and based on the video characteristics of the candidate video, performing action behavior identification on the candidate video to obtain corresponding time period information of each video segment containing the action in the candidate video, and based on each time period information, obtaining each second candidate video segment.

Optionally, if the target video segment includes a first candidate video segment and a second candidate video segment that belong to the same candidate video, the video generation module 120 may specifically be configured to, when generating the target video based on the target video segments generating the target video:

and generating a target video based on the fused video clips.

Optionally, when obtaining each second candidate video segment based on each time period information, the video generating module 120 is configured to:

Optionally, when determining the second candidate video segment from the video segments corresponding to the time interval information based on the level of the time interval information, the video generating module 120 is configured to:

determining a target grade in each grade;

Optionally, if the video generation request includes the video duration of the target video, the video generation module 120 is configured to, when determining a target level of the levels:

Optionally, for a period of information, when determining the level of the period of information based on the segment duration corresponding to the period of information, the video generating module 120 may be specifically configured to perform at least one of the following:

Optionally, a common transition duration interval is corresponding between two adjacent levels, and for a period of time information, when determining a duration interval to which a segment duration corresponding to the period of time information belongs and determining a level corresponding to the duration interval as the level of the period of time information, the video generation module 120 is specifically configured to:

and determining two adjacent levels corresponding to the transition duration interval as the level of the time interval information when the segment duration corresponding to the time interval information belongs to the transition duration interval of the two adjacent levels.

Optionally, when the video generation module 120 determines the target level in each level based on the video duration and the duration threshold corresponding to each level, the video generation module is specifically configured to:

determining a video limit duration based on the video duration;

Optionally, the levels are in a sequence from high to low, and the time length threshold corresponding to the current level is not less than the time length threshold corresponding to the next level; when determining the target level in each level according to the video limit duration and the duration threshold corresponding to each level, the video generation module 120 is specifically configured to:

and if the video limitation duration is less than the duration threshold corresponding to the current level, determining the target level or entering the next level for processing according to the first number of the time period information of which the corresponding segment duration in each time period information belonging to the current level is not greater than the video limitation duration and the second number of the time period information of which the corresponding segment duration in each time period information belonging to the next level is not greater than the video limitation duration.

Optionally, when determining each video segment corresponding to each time period information belonging to the target level as the second candidate video segment, the video generating module 120 is specifically configured to:

and determining each video clip corresponding to each time interval information of which the clip duration is less than the video limit duration in each time interval information belonging to the target level as a second candidate video clip.

Optionally, when the video generating module 120 determines the target level or enters the processing of the next level according to the first number and the second number, the video generating module is specifically configured to:

Optionally, the video generating module 120 is configured to, when screening each candidate video segment based on the intention features to obtain each target video segment:

acquiring video clip characteristics of each candidate video clip;

and screening the candidate video clips based on the correlation corresponding to each candidate video clip to obtain each target video clip.

Optionally, the video generating module 120 is configured to filter each candidate video segment based on the intention features to obtain each target video segment

Acquiring video clip characteristics of each candidate video clip;

determining the weight of each video clip to be processed based on the intention characteristics and the video clip characteristics of each video clip to be processed; each video clip to be processed is a candidate video clip, or each video clip obtained by screening each candidate video clip based on the relevance of the intention characteristic and the video clip characteristic of each candidate video clip;

Optionally, the video generating module 120 is specifically configured to, when determining the weight of each to-be-processed video segment based on the intention feature, and screening out each target video segment based on the video segment feature of each to-be-processed video segment and the weight of each to-be-processed video segment:

Optionally, the video generating module 120 is specifically configured to, when determining the weight of each to-be-processed video segment based on the intention feature and the video segment feature of each to-be-processed video segment, and screening out the target video segment corresponding to each time based on the video segment feature and the weight of each to-be-processed video segment:

for the first operation, determining the weight of each to-be-processed video clip of the first operation based on the intention characteristics and the clip characteristics of each to-be-processed video clip; obtaining weighted video segment characteristics of the first operation based on the weight of each to-be-processed video segment of the first operation; obtaining a hidden state characteristic of the first operation based on the weighted video segment characteristic of the first operation; obtaining a target video clip of the first operation based on the hidden state feature and the weighted video clip feature of the first operation;

for other operations except the first operation, determining the weight of each to-be-processed video clip of the current operation based on the hidden state feature and the intention feature of the last operation and the video clip feature of each to-be-processed video clip; obtaining the weighted video segment characteristics of the current operation based on the weight of each to-be-processed video segment of the current operation; obtaining the hidden state feature of the current operation based on the hidden state feature of the previous operation, the weighted video clip feature of the current operation and the target video clip of the previous operation; and obtaining a target video clip of the current operation based on the hidden state feature of the current operation and the weighted video clip feature of the current operation.

Optionally, if the video generation request includes the video duration of the target video, the video generation module 120 is further configured to, after screening out the target video segment at the current time:

determining the total time length of each screened target video clip;

if the total duration is less than the video duration, performing the screening operation of the target video clip at the next moment;

Optionally, if each target video segment is a target video segment screened by each operation, the video generation module 120 is specifically configured to:

Optionally, when the video generation module 120 generates the target video based on each target video segment, it is specifically configured to:

and generating a target video based on the screened target video clips.

Optionally, when the video generation module 120 filters each target video segment based on the correlation between the video segments in each target video segment, the video generation module is specifically configured to:

determining a reference video clip in each target video clip;

Based on the same principle as the video generation method provided by the embodiment of the present application, the embodiment of the present application further provides an electronic device, which includes a memory and a processor; wherein the memory has stored therein a computer program; the processor is configured to execute the method provided in any alternative embodiment of the present application when executing the computer program.

An embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program executes the method provided in any optional embodiment of the present application.

Fig. 21 is a schematic structural diagram of an alternative electronic device suitable for the method provided in the embodiment of the present application, and as shown in fig. 21, the electronic device 4000 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 21, but this does not mean only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in any of the foregoing method embodiments.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A video generation method, comprising the steps of:

extracting the intention characteristic of the video generation request;

2. The method of claim 1, wherein the intent feature comprises an action behavior intent feature.

3. The method of claim 1 or 2, wherein generating a target video based on the intent features and candidate videos comprises:

respectively extracting the video characteristics of each candidate video;

and generating a target video based on each target video segment.

4. The method of claim 3, wherein for one of the candidate videos, determining candidate video segments in the candidate video based on the video features of the candidate video comprises at least one of:

5. The method of claim 4, wherein if the target video segment includes a first candidate video segment and a second candidate video segment belonging to the same candidate video, the generating the target video based on the target video segments to generate the target video comprises:

and generating the target video based on the fused video clips.

6. The method according to claim 4 or 5, wherein the deriving each second candidate video segment based on each time period information comprises:

7. The method of claim 6, wherein the determining the second candidate video segment from the video segments corresponding to the time interval information based on the level of the time interval information comprises:

determining a target level of each of the levels;

and determining each video clip corresponding to each time period information belonging to the target level as the second candidate video clip.

8. The method of claim 7, wherein if the video generation request includes a video duration of the target video, the determining a target level of the levels comprises:

and determining a target grade in the grades based on the video time length and a time length threshold value corresponding to each grade.

9. The method according to any one of claims 6 to 8, wherein for a period information, determining the level of the period information based on the segment duration corresponding to the period information comprises at least one of:

determining a time interval to which the segment time length corresponding to the time interval information belongs, determining a grade corresponding to the time interval to be the grade of the time interval information, wherein each grade corresponds to a respective time interval;

10. The method according to claim 9, wherein two adjacent levels correspond to a common transition duration interval, and for a period information, determining a duration interval to which a segment duration corresponding to the period information belongs, and determining a level corresponding to the duration interval as the level of the period information comprises:

11. The method according to any one of claims 8 to 10, wherein said determining a target level among said levels based on said video duration and a duration threshold corresponding to each of said levels comprises:

determining a video limit duration based on the video duration;

12. The method according to claim 11, wherein the levels are in descending order, and the time length threshold corresponding to the current level is not less than the time length threshold corresponding to the next level;

determining a target level in each level according to the video limiting time length and the time length threshold corresponding to each level, wherein the target level comprises at least one of the following items:

sequentially comparing the video limiting time length with a time length threshold corresponding to the current grade according to the sequence of the grades in each grade from high to low until the video limiting time length is not less than the time length threshold corresponding to the current grade, and determining the current grade as a target grade;

sequentially performing the following processing according to the sequence of the grades from high to low in each grade until a target grade is determined:

if the video limiting time length is not less than the time length threshold value corresponding to the current level, determining the current level as a target level; and if the video limitation duration is less than the duration threshold corresponding to the current level, determining a target level or entering the processing of the next level according to a first number of time period information in which the corresponding segment duration in each time period information belonging to the current level is not greater than the video limitation duration and a second number of time period information in which the corresponding segment duration in each time period information belonging to the next level is not greater than the video limitation duration.

13. The method of claim 12, wherein determining the target level or entering a next level of processing based on the first number and the second number comprises:

if the first number is not less than the second number, determining the current level as a target level;

if the first number is smaller than the second number and the next level is the last level, determining the next level as a target level;

and if the first number is smaller than the second number and the next level is not the last level, entering the processing of the next level.

14. The method according to any one of claims 2 to 13, wherein the screening each candidate video segment based on the intention characteristics to obtain each target video segment comprises:

acquiring video segment characteristics of each candidate video segment;

determining the relevance of the intention characteristic and the video segment characteristic of each candidate video segment respectively;

15. The method according to any one of claims 2 to 13, wherein the screening each candidate video segment based on the intention characteristics to obtain each target video segment comprises:

acquiring video segment characteristics of each candidate video segment;

determining the weight of each video clip to be processed based on the intention characteristics and the video clip characteristics of each video clip to be processed; each to-be-processed video segment is the candidate video segment, or each video segment obtained by screening each candidate video segment based on the relevance between the intention characteristic and the video segment characteristic of each candidate video segment;

16. The method according to claim 15, wherein the determining a weight of each of the video segments to be processed based on the intention characteristics and the video segment characteristics of each of the video segments to be processed, and screening out each of the target video segments based on the video segment characteristics of each of the video segments to be processed and the weight of each of the video segments to be processed comprises:

based on the intention characteristics and the video clip characteristics of the video clips to be processed, sequentially executing the following operations to obtain a target video clip corresponding to each operation:

determining the weight of each to-be-processed video segment operated at the current time based on the intention characteristic, the video segment characteristic of each to-be-processed video segment and the weight of each to-be-processed video segment determined by at least one operation before the current operation;

and screening out the target video clip corresponding to the current operation based on the video clip characteristics of each to-be-processed video clip, the weight of each to-be-processed video clip of the current operation and the target video clip determined by at least one operation before the current operation.

17. The method according to any one of claims 2 to 16, wherein generating a target video based on each of the target video segments comprises:

and generating a target video based on the screened target video clips.

18. A video generation apparatus, comprising:

19. An electronic device comprising a memory and a processor;

the memory has stored therein a computer program;

the processor, when executing the computer program, is configured to perform the method of any of claims 1 to 17.

20. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 17.