WO2021190078A1

WO2021190078A1 - Method and apparatus for generating short video, and related device and medium

Info

Publication number: WO2021190078A1
Application number: PCT/CN2021/070391
Authority: WO
Inventors: 亢治; 胡康康; 李超
Original assignee: 华为技术有限公司
Priority date: 2020-03-26
Filing date: 2021-01-06
Publication date: 2021-09-30
Also published as: CN113453040B; CN113453040A

Abstract

Provided are a method and apparatus for generating a short video, and a related device and a medium. The method comprises: acquiring a target video, and obtaining, by means of semantic analysis, the starting and the ending time of at least one video clip in the target video and the probability of a semantic category to which the video clip belongs, wherein each video clip belongs to one or more semantic categories; and then, according to the starting and the ending time of the at least one video clip and the probability of the semantic category to which the video clip belongs, generating, from the at least one video clip, a short video corresponding to the target video. A video clip, which belongs to one or more semantic categories, in a target video is identified by means of semantic analysis, so as to directly extract video clips that best reflect the content of the target video and are continuous to composite a short video, such that not only is the continuity of the content between frames in the target video considered, but the efficiency of generating the short video is also improved.

Description

Short video generation method, device, related equipment and medium

Cross-references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 26, 2020, the application number is 202010223607.1, and the application name is "short video generation methods, devices, related equipment and media", the entire content of which is incorporated by reference Incorporated in this application.

Technical field

This application relates to video processing technology, in particular to short video generation methods, devices, related equipment and media.

Background technique

With the continuous optimization of camera effects of terminal devices, the continuous development of new media social platforms, and the increase in the speed of mobile networks, more and more people like to share their daily lives through short videos. Different from the long-term characteristics of traditional videos, short videos generally only last for a few seconds or minutes. Therefore, they have the characteristics of low production cost, fast transmission speed, and strong social attributes, so they are loved by the majority of users. At the same time, because of the limited time length, the video content of the short video must be able to show the key points in a short time. Therefore, people usually perform operations such as filtering and editing long videos to generate a short video that highlights the key points.

At present, there are some professional video editing software that can extract and splice videos according to user operations; there are also some applications that can directly intercept a video clip of a specified duration from the video, such as the first one from a 1-minute video. 10 seconds, or intercept a 10-second clip arbitrarily selected by the user. However, of the above two methods, one is too cumbersome and requires users to learn the software operation and edit by themselves; the other is too simple to cut out all the essence of the video. Therefore, there is a need for a smarter way to automatically extract key segments in the video and generate short videos.

In some solutions in the prior art, the importance of the video image is determined by identifying the feature information of each frame of the video image in the video, and then a part of the video image is screened out to generate a short video according to the importance of each frame of the video image. Although this method realizes the intelligent generation of short videos, because it is aimed at single-frame video image recognition and ignores the association between frames, it is easy to cause the content of short videos to be too fragmented and not coherent enough to express a video. It’s difficult to meet the actual needs of users for short video content. On the other hand, there are a large number of redundant video images in the target video. If each frame of video image is identified one by one, and then the important video images are selected to synthesize a short video after comparing each other, it will lead to too long calculation time. Affect the efficiency of short video generation.

Summary of the invention

This application provides a short video generation method, device, related equipment and media. This method can be implemented by short video generation devices, such as smart terminals, servers, etc., through the video semantic analysis model to identify video clips with one or more semantic categories in the target video, so as to directly extract the target video content and have continuity. Synthesizing short videos with video clips not only considers the continuity of the content between frames in the target video, but also improves the presentation effect of short videos, so that short video content meets the actual needs of users, and also improves the efficiency of short video generation.

The following introduces this application from multiple aspects, and it is easy to understand that the implementation manners of the following multiple aspects can be referred to each other.

In the first aspect, this application provides a short video generation method. The device for generating a short video obtains a target video, where the target video includes multiple frames of video images, determines at least one video segment in the target video through semantic analysis, and obtains the start and end time of the at least one video segment and the probability of the semantic category to which it belongs, where the video Fragments include continuous frames of video images, the number of frames of the video fragment can be equal to or less than the number of frames of the target video, and the video fragment belongs to one or more semantic categories, that is, the continuous frame video images included in the video fragment belong to one or more Semantic category; then according to the start and end time of at least one video segment and the probability of the semantic category to which it belongs, a segment for short video generation is selected from at least one video segment, and the short video is synthesized.

In this technical solution, semantic analysis is used to identify video clips with one or more semantic categories in the target video, so as to directly extract the video clips that best reflect the target video content and have continuity to synthesize a short video. The short video can be used as The video summary of the target video, or the video condensing, this application not only considers the continuity of the content between frames in the target video, but also improves the presentation effect of the short video, so that the short video content better meets the actual needs of users, and it also improves The efficiency of short video generation.

In this technical solution, the video semantic analysis model is used to identify video clips with one or more semantic categories in the target video, so as to directly extract the video clips that best reflect the target video content and have continuity to synthesize short videos. The continuity of the content between frames in the target video improves the presentation effect of the short video, makes the short video content more satisfying the actual needs of users, and also improves the generation efficiency of the short video.

In a possible implementation of the first aspect, the target video includes m frames of video images, and m is a positive integer. During semantic analysis, the device for generating short videos can specifically extract n-dimensional feature data of each frame of video image in the target video. , And generate an m*n video feature matrix based on the time sequence of the m frames of video images, convert the video feature matrix into a multi-layer feature map, and generate at least one corresponding to the video feature matrix based on each feature point in the multi-layer feature map The candidate frame determines at least one continuous semantic feature sequence according to the candidate frame, and determines the start and end time of the video segment corresponding to each continuous semantic feature sequence and the probability of the semantic category to which it belongs, where n is a positive integer.

In this technical solution, by performing feature extraction on the target video, the target video in two dimensions of time and space can be converted into a feature map of spatial dimension that can be presented in a video feature matrix, which is a subsequent evaluation of the target video. Segment segmentation and selection lay the foundation; when the candidate frame is selected, the video feature matrix is substituted for the original image, and the candidate frame generation method originally used for image recognition in the spatial domain is applied to the space-time domain, so that the candidate frame is from the circled image The object area in is transformed into a continuous semantic feature sequence in the delineated video feature matrix. In this way, the purpose of directly identifying the video clips containing semantic categories in the target video is achieved, without the need to identify and filter frame by frame. In this way, compared with the existing cyclic network model in which each frame of video images are connected in time to perform timing modeling, the technical solution is more simple and convenient, so that the calculation speed is faster, and the calculation time and resource occupation are reduced.

In a possible implementation of the first aspect, the probability of belonging to the semantic category includes the probability of belonging to the behavior category and the probability of belonging to the scene category; the target video includes m frames of video images, m is a positive integer, and the short video generation device is in the semantics In the analysis, the probability of the belonging behavior category and the probability of the belonging scene category are obtained separately in two ways. According to the probability of the behavior category, it can extract the n-dimensional feature data of each frame of video image in the target video, and generate an m*n video feature matrix based on the time sequence of m frames of video images, and convert the video feature matrix into multi-layer features Figure, based on each feature point in the multi-layer feature map, generate at least one candidate frame corresponding to the video feature matrix, determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end of the video segment corresponding to each continuous semantic feature sequence Time and the probability of the behavior category, where n is a positive integer. Regarding the probability of the scene category, the probability of the scene category of each frame of video image in the target video can be identified and output according to the n-dimensional feature data of each frame of video image in the target video.

In this technical solution, the identification path of the category of the category and the category of the category of behavior are distinguished, and the probability of the category of the category uses the conventional single-frame image recognition method, which can not only add the category of the scene to the output result, but also focus on the recognition Dynamic behavior categories, using the processing direction that different recognition methods are good at, saving calculation time and improving recognition accuracy.

In a possible implementation of the first aspect, the width of at least one candidate frame generated on the video feature matrix is unchanged.

In this technical solution, the width of the candidate frame remains unchanged, and there is no need to constantly adjust to search for different length and width space ranges, only the length dimension is required to search, which can save the search space time, thereby further saving the calculation time of the model And occupied resources.

In a possible implementation of the first aspect, the device for generating short videos determines the start and end time of each video segment and the probability of the behavior category to which it belongs, and the probability of the scene category of each frame of video image in each video segment. The average category probability of at least one video segment; and then according to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.

In a possible implementation of the first aspect, the short video generation device may calculate an average category probability for each video segment, and specifically may determine the multi-frame video image and the number of frames corresponding to the video segment according to the start and end time of the video segment; Determine the probability of the behavior category of the video clip as the probability of the behavior category of each frame of the video image in the video clip; obtain the probability of the scene category of each frame of the video image in the multi-frame video image; The sum of the probability of the behavior category of each frame of video image and the probability of the category of the scene is divided by the number of frames to obtain the average category probability of the video segment.

In a possible implementation of the first aspect, the device for generating short videos sequentially determines at least one summary video segment from the at least one video segment according to the magnitude order and the start and end time of the probability of the semantic category of the at least one video segment, Then obtain at least one summary video segment and synthesize a short video corresponding to the target video.

In this technical solution, the probability of the semantic category of the video clip can indicate the importance of the video clip. Therefore, at least one video clip can be screened based on the probability of the semantic category. Present more important video clips.

In a possible implementation of the first aspect, the device for generating short videos intercepts video clips in the target video according to the start and end time of each video clip; according to the magnitude order of the probability of the semantic category of at least one video clip, Each video segment is displayed in order; when a selection instruction for any one or more video segments is received, the selected video segment is determined to be a summary video segment; according to at least one summary video segment, a short video corresponding to the target video is synthesized.

In this technical solution, by interacting with the user, the segmented video clips are presented to the user in the order of importance reflected by the probability of the semantic category to which they belong. After the user makes a selection based on his own interests or needs, the corresponding The short video, so that the short video can better meet the needs of users.

In a possible implementation of the first aspect, the device for generating short videos may determine the interest category of at least one video clip according to the probability of the semantic category to which each video clip belongs, and the category weight corresponding to the semantic category of each video clip. Probability: According to the start and end time of the at least one video segment and the probability of the interest category, a short video corresponding to the target video is generated from the at least one video segment.

In this technical solution, on the basis of ensuring the continuity of short video content and the efficiency of short video generation, the category weight corresponding to the semantic category is further considered, so that it can be more effective when selecting video clips for synthesizing short videos. Pertinence, such as picking out specific video clips of one or more semantic categories to meet more flexible and diverse user needs.

In a possible implementation of the first aspect, the device for generating short videos can determine the category weights corresponding to various semantic categories of the media data through the media data information in the local database and historical operation records.

In this technical solution, user preferences are analyzed according to the local database and historical operation records to determine the category weight of the semantic category, so that when selecting video clips for synthesizing short videos, it can be more in line with user interests and get thousands of people. Thousands of short videos.

In a possible implementation of the first aspect, when the device for generating short videos determines the category weight corresponding to each semantic category, it may first determine the semantic category of the video and image in the local database, and make statistics of each category. The number of occurrences of the semantic category; then determine the semantic category of the videos and images that the user has manipulated in the historical operation record, and count the operation duration and frequency of each semantic category; finally according to the number of occurrences and operation duration of each semantic category And operation frequency, calculate the category weight corresponding to each semantic category.

In a possible implementation of the first aspect, the device for generating short videos sequentially determines at least one summary video segment from the at least one video segment according to the magnitude order and start and end time of the interest category probability of the at least one video segment, and then obtains At least one summary video segment is combined with a short video corresponding to the target video.

In this technical solution, the interest category probability of the video clip can indicate the importance of the video clip and the user's degree of interest. Therefore, filtering at least one video clip based on the interest category probability can be within the preset duration of the short video. Try to present more important video clips that are more in line with user interests.

In a possible implementation manner of the first aspect, the sum of the segment durations of at least one summary video segment is not greater than the preset short video duration.

In a possible implementation of the first aspect, the device for generating short videos intercepts video segments in the target video according to the start and end time of each video segment; The clips are sorted and displayed; when a selection instruction for any one or more video clips is received, the selected video clip is determined to be a summary video clip; according to at least one summary video clip, a short video corresponding to the target video is synthesized.

In this technical solution, by interacting with the user, the segmented video clips are presented to the user according to the comprehensive sequence of importance and interest reflected by the probability of the interest category, and the user makes a selection based on his current interests or needs. Then, the corresponding short video is generated, so that the short video can better meet the immediate needs of users.

In a possible implementation of the first aspect, the short video generation device can also perform time domain segmentation on the target video to obtain the start and end time of at least one segment; Starting and ending time, determine at least one overlapping segment between each video segment and each divided segment; generate a short video corresponding to the target video from the at least one overlapping segment.

In this technical solution, the content consistency of the segmented segments obtained by the KTS segmentation is relatively high, and the video segments identified by the video semantic analysis model are segments with semantic categories, which can explain the importance of the video segments. The content consistency and importance of the overlapping segments obtained after the combination of the two segmentation methods are relatively high, and the results of the video semantic analysis model can also be modified, so that the generated short videos are more coherent and meet user needs.

In the second aspect, this application provides a short video generation device. The short video generation device may include a video acquisition module, a video analysis module, and a short video generation module. In some implementation manners, the short video generation device may further include an information acquisition module and a category weight determination module. The short video generation device implements part or all of the methods provided in any implementation manner of the first aspect through the above-mentioned modules.

In a third aspect, the present application provides a terminal device. The terminal device includes a memory and a processor. The memory is used to store computer-readable instructions (or referred to as computer programs), and the processor is used to read the computer-readable instructions to implement the foregoing The method provided by any implementation of the first aspect.

In a fourth aspect, the present application provides a server. The terminal device includes a memory and a processor. The memory is used to store computer-readable instructions (or referred to as computer programs), and the processor is used to read the computer-readable instructions to implement the foregoing Any implementation of one aspect provides a method.

In a fifth aspect, the present application provides a computer storage medium, which may be non-volatile. The computer storage medium stores computer readable instructions, and when the computer readable instructions are executed by a processor, the method provided by any implementation manner of the first aspect described above is implemented.

In a sixth aspect, the present application provides a computer program product that contains computer-readable instructions, and when the computer-readable instructions are executed by a processor, the method provided by any implementation manner of the first aspect described above is implemented.

Description of the drawings

FIG. 1 is a schematic diagram of an application scenario of a short video generation method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an application environment of a method for generating a short video provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of an application environment of another short video generation method provided by an embodiment of the present application;

FIG. 4 is a schematic flowchart of a short video generation method provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a video feature matrix provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a model architecture of a video semantic analysis model provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a feature pyramid provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of the principle of a ResNet50 provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of the principle of an area selection network provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of the model architecture of another video semantic analysis model provided by an embodiment of the present application;

FIG. 11 is a schematic flowchart of another short video generation method provided by an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a terminal device provided by an embodiment of this application;

FIG. 13 is a schematic diagram of a software architecture of a terminal device provided by an embodiment of this application;

FIG. 14 is a schematic structural diagram of a server provided by an embodiment of the present application;

FIG. 15 is a schematic structural diagram of an apparatus for generating a short video provided by an embodiment of the present application.

Detailed ways

In order to facilitate the understanding of the technical solutions of the embodiments of the present application, firstly, the application scenarios to which the related technologies of the present application are applicable are introduced.

As shown in FIG. 1, it is a schematic diagram of an application scenario of a short video generation method provided by an embodiment of this application. The technical solution of this application is suitable for application scenarios where short videos are generated for one or more videos and sent to various application platforms for sharing or storage. Among them, the relationship between video and short video can be one-to-one, many-to-one, one-to-many or many-to-many, that is, one or more short videos can be generated corresponding to one video, or multiple videos can be generated correspondingly One or more short videos. In fact, the short video generation methods in the above several cases are all the same. Therefore, the embodiment of the present application takes the generation of one or more short videos corresponding to one target video as an example for description.

The application scenarios of the embodiments of the present application can derive various specific business scenarios for different businesses. For example, in a video sharing business scenario of a social software or short video platform, a user can take a video, determine to generate a short video from this video, and then share the generated short video with friends of the social software or publish it on the platform. In the business scene of driving records, a short video of a recorded driving record can be generated and uploaded to the traffic police platform. In the business scenario of storage space cleaning, all videos in the storage space can be generated into corresponding short videos and saved in the album, and then the original videos in the storage space can be deleted, compressed or migrated to save storage space. For another example, for some movies, TV series, documentaries and other various video image content, users want to browse the image content through a few minutes of video summary, and select the video they are interested in to watch. The technical solution of this application is also applicable to such video content. The video image content generates a video summary, or condensed video, which is convenient for users to browse and view.

The short video generation method in the embodiment of the present application may be implemented by a short video generation device. The short video generation apparatus in the embodiment of the present application may be a terminal device or a server.

When the terminal device is implemented, the terminal device should have functional modules or chips (such as video semantic analysis module, video playback module, etc.) that implement the technical solution to generate short videos. The application installed on the terminal device can also call the local functions of the terminal device Module or chip for short video generation.

When implemented by a server, the server should have a functional module or chip (such as a video semantic analysis module) for implementing the technical solution to generate short videos. The server may be a storage server for storing data. The technical solutions of the embodiments of this application may be used to generate short videos from the stored videos as a kind of video summary, and the video data can be sorted, classified, called, compressed, and migrated based on this. Wait for operations to improve storage space utilization and data call efficiency. The server can also be a client with a short video generation function or a server corresponding to a web page. Among them, the client can be an application installed on the terminal device, or a small program carried on the application; the webpage can be a page running on a browser, etc. In the scenario shown in Figure 2, after the terminal device obtains the short video generation instruction triggered by the user, it sends the target video to the server corresponding to the client. The server generates the short video, and then returns the short video to the terminal device. The device performs operations such as sharing and storing short videos. For example, user A clicks the short video generation instruction on the short video client, and the terminal device transmits the target video to the background server for short video generation processing. The server generates the short video and returns it to the terminal device. User A can share the short video Give it to user B or store it in the draft box, gallery and other storage space. In the scenario shown in Figure 3, the user of the terminal device A can trigger a short video sharing instruction, which carries the target user identification. In addition to returning to the terminal device for sharing and storage, the server can also directly identify the corresponding target user. Terminal device B shares. For example, user A clicks the short video sharing instruction on the short video client, and the short video sharing instruction carries the identification of target user B. At this time, the short video client transmits the target video and the identification of target user B to the server, and the server generates After the short video, the short video can be directly sent to the terminal device B corresponding to the identifier of the target user B, and at the same time, the short video can also be returned to the terminal device A. Furthermore, the terminal device can further interact with the server during the short video generation process. For example, the server can send the segmented video clip to the terminal device, and the terminal device sends the video clip or the video clip identifier selected by the user to the server. Therefore, the server performs short video generation and so on according to the user's selection. Therefore, it can be understood that the foregoing implementation scenarios only exemplarily show part of the scenarios to which the technical solutions of the present application are applicable.

Based on the above example scenarios, the terminal device in the embodiment of the present application may specifically be a mobile phone, a tablet computer, a notebook computer, a vehicle-mounted device, a wearable device, etc., and the server may specifically be a physical server, a cloud server, etc.

In this application scenario, in order to generate a short video corresponding to the video, it needs to go through three stages: video segmentation, video segment selection, and video segment synthesis. Specifically, the terminal device divides multiple meaningful video clips from the video, and then selects important video clips that can be used to generate short videos from the multiple video clips, and finally synthesizes the selected video clips, thereby Get the short video corresponding to the video. The technical solution of the embodiment of the present application is optimized for the above three stages.

It should be understood that the application scenarios described in the embodiments of the present application are intended to more clearly illustrate the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided in the embodiments of the present application. Those of ordinary skill in the art will know that with With the emergence of new application scenarios, the technical solutions provided in the embodiments of this application are equally applicable to similar technical problems.

Please refer to FIG. 4. FIG. 4 is a schematic flowchart of a short video generation method provided by an embodiment of the present application. The method includes but is not limited to the following steps:

S101: Obtain a target video.

In the embodiment of the present application, the target video includes multiple frames of video images, which is a video used to generate a short video, and can also be understood as a material for generating a short video. For the convenience of subsequent description, m can be used to represent the number of frames of the target video, that is, the target video includes m frames of video images, and m is a positive integer greater than or equal to 1.

Based on the above description of the application scenario, the target video may be a video shot immediately by the terminal device, for example, a video shot after the user turns on the shooting function of a social software or a short video platform. The target video may also be a historical video stored in a storage space, such as a video in a media database of a terminal device or a server. The target video may also be a video received from another device, for example, the video carried in the short video generation instruction message received by the server from the terminal device.

S102: Obtain the start and end time of at least one video segment in the target video and the probability of the semantic category to which it belongs through semantic analysis.

In the embodiments of the present application, the semantic analysis can be implemented by using a machine learning model, which is referred to as a video semantic analysis model in this application. The video semantic analysis model can realize the functions of the video segmentation stage among the three stages in Fig. 1 and provide probabilistic data support for the video segment selection stage. The video segmentation in the embodiments of this application can be understood as video segmentation based on video semantic analysis. The purpose is to determine video segments belonging to one or more semantic categories in the target video, where the video segment refers to a continuous video of k frames. Image, k is a positive integer less than or equal to m. It can be seen that, unlike the video clips formed by screening and reorganizing single-frame video images in the prior art, the embodiment of the present application directly divides the video clips with continuous semantics in the target video to avoid the short video generated at the end. It is too jumpy, and can save synthesis time and improve the efficiency of short video generation.

Specifically, the video semantic analysis model may have an image feature extraction function to extract n-dimensional feature data of each frame of video image in the target video, where n is a positive integer. The n-dimensional feature data may reflect the spatial feature of a frame of video image. In the embodiment of the present application, the specific method of feature extraction may not be limited, and each dimensional feature data may not point to a specific attribute feature. Specifically, it can extract attribute feature dimensions such as RGB parameters, or it can be abstract feature data obtained by fusion of multiple features extracted through neural networks and other methods. Then, the video semantic analysis model can generate an m*n video feature matrix based on the time sequence of m frames of video images included in the target video. The video feature matrix here can be understood as a spatio-temporal feature map, which not only reflects the spatial characteristics of each frame of video image, but also reflects the sequence of frames in time sequence. As shown in FIG. 5, it is an exemplary video feature matrix, in which each row represents n-dimensional feature data of a frame of video image, and the columns are arranged according to the time sequence of the target video.

Through feature extraction of the target video, the target video in two dimensions of time and space can be converted into a feature map of spatial dimension that can be presented in a video feature matrix, which lays the foundation for the subsequent segmentation and selection of the target video. Basically, compared with the existing cyclic network model in which each frame of video images are connected in time to perform timing modeling, the video semantic analysis model of the embodiment of the present application can be designed more simply, so that the calculation speed is faster and the calculation is reduced. Time and resource consumption.

The video semantic analysis model can identify the corresponding at least one continuous semantic feature sequence from the video feature matrix. The continuous semantic feature sequence is a continuous feature sequence that belongs to one or more semantic categories predicted by the video semantic analysis model, and may include feature data in one or more consecutive frames. Still taking Fig. 5 as an example, the feature data enclosed in the first box and the second box correspond to the continuous semantic feature sequence a and the continuous semantic feature sequence b, respectively. Among them, the semantic category can be a large category such as a behavior category, an expression category, an identity category, a scene category, etc., and can also be various subordinate categories in the large category, such as a ball game category and a handshake category in the behavior category. Understandably, semantic categories can be defined according to actual business needs.

It is understandable that each continuous semantic feature sequence can correspond to a video segment. For example, the continuous semantic feature sequence a in FIG. 5 corresponds to the continuous video images of the first frame and the second frame of the target video. It can be seen that in the implementation scenario of the embodiment of the present application, the main focus is on the content in the time domain, so one output of the video semantic analysis model is the start and end time of the video segment corresponding to the continuous semantic feature sequence. For example, the start and end times of the video segment corresponding to the continuous semantic feature sequence a are the start time t1 of the first frame and the end time t2 of the second frame, and the output is (t1, t2). In addition, when the video semantic analysis model predicts the semantic category of the continuous semantic feature sequence, it actually predicts the probability of matching the features of the continuous semantic feature sequence with various semantic categories, and determines the most consistent category as the semantic category to which it belongs. The category also corresponds to a prediction probability. The video semantic analysis model of the embodiment of this application can output the probability of the semantic category of the continuous semantic feature sequence, so that the video semantic analysis model can determine the video segment corresponding to the continuous semantic feature sequence. The start and end time and the probability of belonging to the semantic category.

In a possible implementation scenario, the video semantic analysis model can be the model architecture shown in Figure 6, specifically including Convolutional Neural Networks (CNN) 10, Feature Pyramid Networks (FPN, Feature Pyramid Network) 20 , Sequence Proposal Network (SPN, Sequence Proposal Network) 30 and the first fully connected layer 40. S102 is described in detail below for the model architecture.

First, input the acquired target video into CNN. CNN is a common classification network, which can generally include an input layer, a convolutional layer, a pooling layer, and a fully connected layer. Among them, the function of the convolutional layer is to perform feature extraction on the input data. After the feature extraction of the convolutional layer, the output feature map will be passed to the pooling layer for feature selection and information filtering, and the information left is scaled The feature of invariance is the feature that best expresses the image. In the embodiment of this application, the feature extraction functions of the two layers in CNN are used, and the output of the pooling layer is used as the n-dimensional feature data of each frame of video image in the target video, and is based on the time sequence of the m frames of video images included in the target video. , Generate m*n video feature matrix. It should be noted that the embodiments of this application do not limit the specific model structure of CNN, and classic image classification networks such as ResNet, GoogleNet, MobileNet, etc. can be applied to the technical solutions of the embodiments of this application.

Then, the m*n video feature matrix is transferred to the FPN. Generally, when using the network to detect objects, the shallow network has high resolution and learns the detailed features of the image, while the deep network has low resolution and learns more semantic features. Therefore, most object detection algorithms are It only uses top-level features to make predictions. However, since a feature point of the deepest feature map is mapped to a larger area in the original image, small objects will not be detected, resulting in low detection performance. At this time, the detailed characteristics of the shallow network are particularly important. As shown in Figure 7, FPN is a network that integrates features between multiple layers. It can combine high-level features with low-resolution and high-semantic information and low-level features with high-resolution and low-semantic information from top to bottom. Connect to generate feature maps of each layer at multiple scales, and each layer of features has rich semantic information, and the more accurate the recognition. It can be seen that because the upper layer has a smaller feature map size, it forms a pyramid-like shape. In the embodiment of the present application, the video feature matrix is converted into such a multi-layer feature map.

Take the 50-layer deep residual network (ResNet50) as an example to illustrate the realization principle of FPN. As shown in Figure 8, firstly, forward propagation is performed in the network from the bottom to the top, and the convolution calculation of 2 times downsampling is sequentially performed on the lower layer features to obtain four feature maps of C2, C3, C4, and C5. Further perform 1*1 convolution on each feature map, and then connect horizontally from top to bottom, that is, start from M5 and perform 2 times upsampling and sum up with the 1*1 convolution result of C4 to obtain M4, M4 and C3 The fusion is also carried out in accordance with the above method, and so on. Finally, M2, M3, M4, and M5 are respectively subjected to 3*3 convolutions to obtain P2, P3, P4, and P5, and M5 is down-sampled twice to obtain P6. P2, P3, P4, P5, and P6 are the five-layer feature pyramid.

Further, the feature pyramid is transferred to the SPN. SPN can generate corresponding candidate frames for each layer of feature maps of the feature pyramid to determine continuous semantic feature sequences. In order to understand the principles of SPN more clearly, before introducing SPN, first introduce the Region Proposal Network (RPN).

RPN is a region selection network, which is generally used for object detection (object detection, face detection, etc.) in an image to determine the specific area of the object in the image. As shown in FIG. 9, a feature map can be obtained from an image after feature extraction. The feature map can be understood as a matrix composed of multiple feature data to characterize image features. One feature point in the feature map represents one feature data. The feature points in the feature map have a one-to-one mapping relationship with the original image. As shown in Figure 7, one of the feature points mapped to the original image is a small box. The specific size of this small box is the same as that of the original image and the feature map. The ratio is related. Using the center point of the small box as the anchor point can generate a set of anchor point boxes. The number of anchor point boxes and the aspect ratio of each anchor point box can be preset. For example, the three large boxes shown in Figure 9 are based on A set of anchor boxes generated by a preset number and aspect ratio. It is understandable that each feature point in the feature map will correspond to a set of anchor point boxes mapped on the original image, so p*s anchor point boxes will be mapped on the original image, where p is the number of feature points in the feature map , S is the number of a set of preset anchor boxes. While determining the anchor point box, RPN will also judge the background scene of the image in the anchor point box to obtain a foreground score and a background score. The first few anchor boxes in the foreground score can be sorted out and determined as For the real anchor point frame, the specific number of selected anchor point frames can be set according to the situation. In this way, useless background content can be filtered, and the anchor point frame can be concentrated on the area with more foreground content, which is convenient for subsequent category recognition. When training RPN, the training sample is the center position of the real frame and the length and width scales. The training makes the gap between the anchor frame and the real frame as close as possible to the predicted gap between the candidate frame and the anchor frame. So that the candidate frame output by the model is more accurate. Since the training refers to the gap between the candidate frame and the anchor point frame, when applying RPN to extract the candidate frame, the output of RPN is also the offset of the predicted candidate frame relative to the anchor point frame, that is, the translation of the center position. (t _x , t _y ) and the change in length and width (t _w , t _h ).

The principles of SPN and RPN in the embodiment of this application are basically similar. The main difference is that the feature points in each layer of the feature map of the feature pyramid in the embodiment of this application are not mapped to the original image, but to the video feature matrix. Therefore, the candidate The frame is also generated on the video feature matrix, so that the candidate frame changes from the extracted area to the extracted feature sequence. In addition, the candidate frame generated on the video feature matrix carries both time and space information. As mentioned above, the embodiment of the present application mainly focuses on the content of the time domain. In the video feature matrix, the length represents the time dimension, and the width represents the space dimension. We only pay attention to the length of the candidate frame, not the width. Therefore, when the length and width of the candidate frame in the embodiment of the present application are preset, the width can remain unchanged. In this way, SPN does not need to constantly adjust to search for spatial ranges of different lengths and widths like RPN, but only needs to search in the length dimension. The search space time can be saved, thereby further saving the calculation time and the occupied resources of the model. Specifically, the width can be consistent with the dimension of the n-dimensional data of the video feature matrix, so that the candidate frame encloses all the features, and what is extracted will be the full-dimensional feature data of various time periods.

For example, if the size of the feature map of the P2 layer is 256*256, and its step size relative to the video feature matrix is 4, then a feature point on P2 will generate a 4*4 small box on the video feature matrix as an anchor Point, if you set 4 reference pixel sequence values {50, 100, 200, 400}, then take the anchor point as the center, and each feature point will generate 4 length values corresponding to {4*50, 4*100, 4 *200, 4*400} anchor point box, the width of the anchor point box is n, to encircle n-dimensional data.

That is to say, in the embodiment of the present application, the change in the center position of the candidate frame is only the offset in the length direction, and the change in the scale of the candidate frame is only the increase or decrease in the length direction. Therefore, the training samples of the SPN may be feature sequences of multiple semantic categories and the coordinates and length values of the center position of the marked real frame in the length dimension. Correspondingly, when SPN is used for candidate frame extraction, the SPN output is also the lengthwise offset of the predicted candidate frame relative to the anchor point frame, that is, the translational amount of the center position in the length direction (t _y ) and The amount of change in length (t _h ). The candidate frame is determined according to the offset, so that a continuous sequence with objects is selected in the video feature matrix, that is, a continuous semantic feature sequence. It should be noted that, except for the above width coordinates, the training methods of SPN, including loss function, classification error, regression error, etc., are similar to RPN, so I will not repeat them here.

It is understandable that each feature point on each feature map of the feature pyramid will be mapped to multiple candidate frames of preset size in the video feature matrix. Such a large number of candidate frames may result in a gap between each candidate frame. Overlap, leading to the final truncation of many repeated sequences. Therefore, after the candidate frames are generated, a non-maximum suppression (NMS) method can be further used to filter out overlapping redundant candidate frames, and only the candidate frame with the largest amount of information is retained. The principle of NMS is to screen based on the Intersection-over-Union (IoU) between overlapping candidate frames. Since NMS is already a common filtering method for candidate frames or detection frames, it will not be repeated here.

Furthermore, since the size ratio of the feature map of each layer in the feature pyramid to the video feature matrix is different, the size of the continuous semantic feature sequences cropped by the candidate frame will also have a large gap. Before the feature sequence is classified, it is more difficult to adjust the continuous semantic feature sequence to a fixed size of the same size. Therefore, according to the length of the continuous semantic feature sequence, it can be mapped to a certain layer of the feature pyramid, so that the size of multiple continuous semantic feature sequences can be as close as possible. In the embodiment of the present application, the larger the continuous semantic feature sequence is, the higher-level feature map is selected for mapping, and the smaller the continuous semantic feature sequence is, the lower-level feature map is selected for mapping. Specifically, the following formula can be used to calculate the level d of the feature map of the continuous semantic feature sequence mapping:

d=[d ₀ +log ₂ (wh/244)]

Among them, d ₀ is the initial level. In the embodiment shown in FIG. 8, P2 is the initial level, so d ₀ is 2, and w and h are respectively the length of the width of the continuous semantic feature sequence in the video feature matrix. It can be understood that w remains unchanged, and the larger the h, the larger d, so that the level of the feature map to be mapped is higher.

After that, the continuous semantic feature sequence cut out after the feature map of the corresponding layer is mapped can be adjusted in size and input into the first fully connected layer 40. The first fully connected layer 40 performs semantic classification on each continuous semantic feature sequence, and outputs the probability of the semantic category of the video segment corresponding to the continuous semantic feature sequence. At the same time, it can also output according to the center and length offset of the continuous semantic feature sequence. The start and end time of the video clip, that is, the start time and end time.

As can be seen from the above description, in the embodiment of this application, the SPN replaces the original image with the video feature matrix, and applies the candidate frame generation method originally used for image recognition in the spatial domain to the spatio-temporal domain, making the candidate frame from the circled image The object area is transformed into the time range in the circled video. In this way, the purpose of directly identifying the video clips containing semantic categories in the target video is achieved, without the need to identify and filter frame by frame.

From the above description, it can be seen that the model architecture of Figure 6 can be used to identify dynamic continuous semantics, such as dynamic behaviors, expressions, scenes, etc., but for categories such as static scenes, there is no difference between frames. If the model architecture of Figure 6 is still used for implementation, it will waste computing time and the identification is not accurate. At this point, it can be divided into two implementation scenarios.

In the first possible implementation scenario, the model in FIG. 6 is used to identify the semantic category of the video clip, where the semantic category may include at least one action category, at least one expression category, at least one identity category, and at least one Any one or more of dynamic scenes. It can be seen that in this implementation scenario, dynamic semantics such as actions, expressions, faces, and dynamic scenes are mainly recognized. Therefore, the video semantic analysis model in Figure 4 can be used to directly obtain the probability of at least one video segment belonging to the behavior category. . Specifically, the semantic category of a certain video clip can be one type, for example, the probability that the video clip from the start and end time of t1-t2 belongs to the kicking category is 90%; the semantic category can also be multiple, for example, The probability that the video clips from t3-t4 belong to the kicking category is 90%, the probability that they belong to the laughter category is 80%, and the probability that they belong to a certain face is 85%. At this time, the video clip of t3-t4 The probability of belonging to the behavior category can be the sum of the above three probabilities. In this implementation scenario, the video semantic analysis model is mainly used to identify dynamic semantic categories. When the dynamic semantic category can match the user's awareness of the importance of the video segment, the model can be used for identification.

In the second possible implementation scenario, the probability of belonging to the semantic category may include the probability of belonging to the behavior category and the probability of belonging to the scene category. In this implementation scenario, as shown in Figure 10, another second fully connected layer 50 can be introduced after the CNN in the video semantic analysis model, and each frame can be identified based on the n-dimensional feature data of each frame of the video image in the target video. The probability of the scene category of the frame video image. At this time, the video semantic analysis model can output the start and end time of at least one video segment, the probability of the behavior category to which it belongs, and the probability of the scene category to which it belongs. It is understandable that the probability of the scene category output by the video semantic analysis model in this scene can be the probability of the corresponding scene category of each frame of video image corresponding to the start and end time, or the probability of the scene category of each frame of the target video. . In this implementation scenario, the identification path of the category of the category and the category of the action are distinguished. Whether the category of the category is static or dynamic, the conventional single-frame image recognition method is used, that is, the category of the category is identified Recognition is performed through CNN10 and the second fully connected layer 50 alone. FPN20, SPN30 and the first fully connected layer 40 are more focused on the identification of dynamic behavior categories, so that the processing directions of each network can be used to distinguish the categories of static scenes. While adding to the output result, it can save calculation time and improve recognition accuracy.

S103: According to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs, generate a short video corresponding to the target video from the at least one video segment.

According to the start and end time of at least one video clip, the short video generating device can determine the video clip with semantic category in the target video, and then according to the probability of belonging to the semantic category, combined with the set screening rules, it can filter out the video clips that meet the requirements. Finally, a short video corresponding to the target video is generated. Among them, the screening rule can be a preset short video duration or number of frames, and can also be the user's interest in various semantic categories, etc.

It can be understood that the probability of the semantic category belonging to the video clip can represent the diversity and accuracy of the semantic components in the video clip. Therefore, in the embodiment of the present application, the probability of the semantic category belonging to the video clip is used as an index to measure the importance of the video clip. , Used to filter out video clips used to generate short videos from at least one video clip. Specifically, in the different scenarios mentioned above, short videos have different generation methods.

In the first implementation scenario described above, there can be two implementation ways to generate short videos.

In the first implementation of the first possible implementation scenario, the device for generating short videos may sequentially determine the at least A summary video segment; obtain at least one summary video segment and synthesize a short video corresponding to the target video.

It is understandable that short videos have the characteristics of short time and have certain requirements for the duration of the short videos. Therefore, it is necessary to filter at least one video segment in combination with the duration of the short videos. In the first implementation manner, the short video generation device can sort the probability of the semantic category of at least one video segment, and then combine the start and end time of each video segment and the short video duration to sequentially select at least one summary Video segments, and the sum of the segment durations of at least one summary video segment is not greater than the preset short video duration. For example, the video semantic analysis model segmented 3 video segments, which are sorted by probability: segment C—135%, segment B—120%, and segment A—90%. Among them, the segment duration of segment A is 10s, and the segment duration of segment B is 10s. The clip duration is 5s, and the clip duration of clip C is 2.5s. If the preset short video duration is 10s, clip C will be selected first, then clip B will be selected, and finally when clip A is selected, it is found that the sum of the clip durations exceeds 10s, segment A is not selected, only segment C and segment B are selected, and a short video is generated. Furthermore, it is also possible to add transition special effects between multiple summary video clips to supplement the remaining time in the short video duration.

On the other hand, if the difference between the sum of the segment durations of at least one summary video segment and the preset short video duration does not exceed the preset threshold, the summary video segment may also be cropped to meet the short video duration requirement. For example, the device for generating short videos may crop the last summary video clips in the order, or may perform partial cropping on each summary video clip, and finally generate a short video that meets the length of the short video. For example, if the segment duration of segment A in the above example is 3s, the last 0.5s of segment A can be cropped, or the three segments can be cropped each by 0.2s, etc., to generate a short video within 10s. For another example, if the segment duration of segment C is 11s in the above example, segment C also needs to be cropped to meet the short video duration.

Further, when the short video is generated, the short video generating device may intercept the corresponding summary video segment in the target video according to the start and end time of at least one summary video segment, and then splice the corresponding summary video segment to generate the short video. Specifically, at least one summary video segment can be spliced in the order of the probability of the semantic category to which it belongs, so that important summary video segments can be presented in the front section of the short video, highlighting key points, and attracting user interest. The splicing can also be performed according to the time sequence of at least one summary video segment in the target video, so that the short video can be presented according to the real timeline in the target video, and the original time clues of the target video can be restored.

In addition to the above methods, there are other ways to crop, splice, and add special effects to the summary video clips, and it can also separate the audio and images in the target video. You can also filter the subtitle information according to the start and end time of the summary video clips and add them The corresponding summary video clip is medium. Since these video editing methods already have a variety of existing technologies, this application will not repeat them.

Based on the above description, it can be seen that the probability of the semantic category of the video clip can indicate the importance of the video clip. Therefore, filtering at least one video clip based on the probability of the semantic category can be within the preset duration of the short video. Try to present more important video clips.

In the second implementation manner of the first implementation scenario, the short video generation device can intercept the video segment in the target video according to the start and end time of each video segment, according to the probability of the semantic category of at least one video segment. Sequence, sort the video clips and display them. When a selection instruction for any one or more video segments is received, it is determined that the selected video segment is a summary video segment, and a short video corresponding to the target video is synthesized according to at least one summary video segment.

In the second implementation manner, the short video generation device first cuts out the video clips from the target video according to the start and end time of each video clip, and then presents them in the order of the probability of the semantic category of at least one video clip. Users, so that users can view and select these video clips according to their own interests or preferences, and select one or more video clips as summary video clips through touch, click and other selection instructions, so as to further generate short videos based on the summary video clips . The method of generating a short video based on the summary video segment is similar to the first implementation, and will not be repeated here. It can be seen that the second implementation method interacts with the user to present the segmented video clips to the user in the order of importance. After the user makes a selection based on his own interests or needs, the corresponding short video is generated, thereby Make short videos better meet the needs of users.

Optionally, when generating a short video corresponding to the target video from at least one video clip, the short video generating device may also first obtain the subject keyword input by the user or in the historical record, and compare the semantic category of the at least one video clip with The topic keywords are matched, the video segment whose matching degree meets the threshold is determined as the topic video segment, and then a short video corresponding to the target video is generated from at least one topic video segment.

Further optionally, when generating a short video corresponding to the target video from at least one video segment, the short video generation device may first perform time domain segmentation on the target video to obtain the start and end time of at least one segment, and then according to the at least one video segment The start and end time of the segment and the start and end time of at least one segment are determined, and at least one overlap segment between each video segment and each segment is determined, and then a short video corresponding to the target video is generated from the at least one overlap segment.

Specifically, the target video may be subjected to Kernel Temporal Segmentation (KTS). KTS is a change point detection algorithm based on the kernel method. It detects the jump point in the signal by focusing on the consistency of one-dimensional signal characteristics, and can distinguish whether the signal jump is caused by noise or content change. In the embodiment of this application, KTS can perform statistical analysis on the characteristic data of each frame of the input target video to detect the signal transition point, so as to realize the division of video clips of different content, and divide the target video into For several non-overlapping segmented segments, the start and end time of at least one segment can be obtained. Then, the start and end times of at least one video segment are combined to determine at least one overlapping segment between each video segment and each divided segment. For example, if the start and end time of a divided segment are t1-t2, and the start and end time of a video segment are t1-t3, the overlapping segment is the segment corresponding to t1-t2. Finally, referring to the above-mentioned two implementation manners of the first possible implementation scenario, a summary video segment can be determined from at least one overlapping segment to generate a short video corresponding to the target video.

It can be seen that the content consistency of the segmented segments obtained by KTS segmentation is relatively high, and the video segments identified by the video semantic analysis model are segments with semantic categories, which can illustrate the importance of the video segments. The content consistency and importance of the overlapping segments obtained after the combination of the two segmentation methods are relatively high, and the results of the video semantic analysis model can also be modified, so that the generated short videos are more coherent and meet user needs.

In the second possible implementation scenario, the probability of belonging to the semantic category includes the probability of belonging to the behavior category and the probability of belonging to the scene category. Because the probability of belonging to the behavior category is for a video clip, the probability of belonging to the behavior category is for For each frame of video image in a video segment, the two probabilities can be integrated together before selecting the summary video segment. In other words, according to the start and end time of each video segment and the probability of its behavior category, and the probability of each frame of video image in each video segment, the average category probability of at least one video segment can be determined first, and then According to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.

Specifically, for each video segment, the short video generation device can determine the multi-frame video image and the number of frames corresponding to the video segment according to the start and end time of the video segment, and determine the probability of the behavior category of the video segment as a multi-frame video The probability of the behavior category of each frame of the video image in the image, that is, the probability of the behavior category of each frame of the video clip corresponding to the behavior category is consistent with the probability of the behavior category of the entire video clip. Then, the probability of the scene category of each frame of the video image in the multi-frame video image output by the video semantic analysis model is obtained, and the probability of the behavior category of each frame of the video image in the multi-frame video image corresponding to the video segment is compared with the probability of the corresponding behavior category. The sum of the probabilities of the scene category is divided by the number of frames to obtain the average category probability of the video segment. According to the above method, the average category probability of at least one video segment is finally determined.

According to the average category probability of at least one video segment, when a short video corresponding to the target video is generated from at least one video segment, the short video generating device can sort according to the average category probability, and automatically determine the summary video segment or user-specified summary video segment , And then synthesize a short video based on the summary video clip. The specific details are similar to the two implementations in the first scenario, and you can refer to the above description, which will not be repeated here. In the same way, in this implementation scenario, subsequent operations can also be performed based on the above-mentioned overlapping segments after the KTS segmentation, which will not be repeated here.

Based on the above technical solutions, it can be seen that the embodiment of the present application uses the video semantic analysis model to identify video clips with one or more semantic categories in the target video, so as to directly extract the video clips that best reflect the target video content and have continuity. Synthesizing short video not only considers the continuity of the content between frames in the target video, but also improves the presentation effect of the short video, so that the short video content can better meet the actual needs of users, and the generation efficiency of the short video is also improved.

Further, in some business scenarios to which the embodiments of the present application are applicable (for example, short video sharing business scenarios of social software), short videos can also be generated in combination with user interests, so that short videos are more suitable for users' preferences. Please refer to FIG. 11. FIG. 11 is a schematic flowchart of another short video generation method provided by an embodiment of the present application. The method includes but is not limited to the following steps:

S201: Obtain a target video.

S202: Obtain the start and end time, the semantic category and the probability of the semantic category of at least one video segment in the target video through semantic analysis.

For the specific implementation of S201-S202, please refer to the description of S101-S102. The difference is that S102 can only output the probability of the semantic category, while S202 outputs both the semantic category and the probability of the semantic category, which will not be repeated here.

S203: Determine the interest category probability of at least one video clip according to the probability of the semantic category to which each video clip belongs and the category weight corresponding to the semantic category to which it belongs.

In the embodiments of the present application, there are corresponding category weights for various semantic categories. The category weights can be used to characterize the user's interest in the respective semantic categories. A high semantic category indicates that the user has a large number of images or videos stored in this category, that is, more interested, the higher the category weight can be set; for example, the more images or videos viewed in the historical operation record The semantic category belongs to, indicating that users pay more attention to images or videos in this category, and a higher category weight can also be set. Specifically, the corresponding category weights can be determined for various semantic categories in advance, and then the category weights corresponding to the semantic categories of each video clip can be directly called.

In a possible implementation manner of the embodiment of the present application, the category weights corresponding to various semantic categories can be determined through the following steps:

Step 1: Obtain the media data information in the local database and historical operation records.

In the embodiment of the present application, the local database may be a storage space for storing or processing various types of data, or a dedicated database dedicated to storing media data (pictures, videos, etc.), such as a gallery. Historical operation records refer to records generated by users of various operations (browsing, moving, editing, etc.) of data, such as local log files. Media data information refers to various types of information such as images and videos, which can include images and videos themselves, can be characteristic information of images and videos, can be operation information of images and videos, and can also be various types of images and videos. Item statistics, etc.

Step 2: Determine the category weights corresponding to various semantic categories of the media data according to the media data information.

In a possible implementation manner, first, the short video generation device can determine the semantic category of the video and image in the local database, and count the number of occurrences of each semantic category. Then, determine the semantic category of the videos and images operated by the user in the local log file, and count the operation duration and frequency of each semantic category. Specifically, for the videos and images included in the local database and the videos and images operated by the user in the local log file, semantic analysis can be performed, and finally each image and the semantic category of each video can be obtained. In the implementation process, the video semantic analysis model mentioned in step S102 can be used to analyze the video to obtain the semantic category of the video; the image recognition model commonly used in the prior art can be used to analyze the image to obtain the semantics of the image. category. Then count the number of occurrences, operation duration and operation frequency of each semantic category. For example, there are 6 pictures and 4 videos in the gallery, the number of appearances of the playing category is 5 times, the number of appearances of the eating category is 1 time, and the number of appearances of the smile category is 2 times. It should be noted that the operations here can include various operations such as browsing, editing, sharing, etc. When calculating operation duration and operation frequency, you can perform separate statistics for each operation, or perform total statistics for all operations. For example, you can The browsing frequency of the statistical playing category is 2 times/day, the editing frequency is 1 time/day, the sharing frequency is 0.5 times/day, the browsing time is 20 hours, and the editing time is 40 hours; the operating frequency of the statistics playing category is also 3.5 Times/day, operating time is 60 hours. Finally, according to the number of occurrences, operation duration and operation frequency of each semantic category, the category weight corresponding to each semantic category is calculated. Specifically, the weight of the category corresponding to each semantic category can be calculated according to the preset weight formula, combined with the number of occurrences, operation duration, and operation frequency of each semantic category. Among them, the preset weight formula can reflect that the greater the number of occurrences, the operation duration, and the operation frequency, the higher the category weight of the semantic category to which it belongs.

Optionally, the following formula can be used to calculate the category weight w _{i of} any semantic category i:

Among them, count _{freq_i} , view _{freq_i} , view _{iime_i} , share _{freq_i,} and edit _{freq_i} are the number of occurrences, browsing frequency, browsing time, sharing frequency and editing frequency of the semantic category i in the local database and historical operation records, respectively.

with

They are the number of occurrences, browsing frequency, sharing frequency, and editing frequency of all h semantic categories identified in the local database and historical operation records.

Finally, the category weights W=(w ₁ , w ₂ ......w _h ) of the semantic categories belonging to h can be obtained.

Specifically, for each video segment, there can be one or more semantic categories. When there is only one semantic category (for example, it belongs to the handshake category), the category weight of this semantic category can be determined, and the category weight can be calculated The product of the probability of belonging to the semantic category is used as the probability of the interest category of the video segment. When there are multiple semantic categories (such as the handshake category and the smile category), the category weight of each semantic category can be determined separately, and then the product of the category weight and the probability corresponding to each semantic category can be calculated and sum, To get the probability of the interest category of the video clip. For example, assuming that the semantic categories of video clip A include category 1 and category 2, the probability of category 1 is P ₁ , the probability of category 2 is P ₂ , and the category weights corresponding to category 1 and category 2 are w ₁ and w ₂ , Then the interest category probability P _{w of} the video segment A = P ₁ *w ₁ +P ₂ *w ₂ .

Further, since the semantic category can include multiple categories, as mentioned above, multiple categories can also be divided into several major categories, so the weights of major categories can also be set, for example, smile category, cry category , Angry categories can be regarded as expression categories or face categories, while swimming categories, running categories, and playing categories can all be regarded as behavior categories. The two categories of face and behavior can be specifically set in different categories. Weights. The specific setting method can be adjusted by the user, or the weight of the major categories can be further determined according to the above-mentioned local database and historical operation records. Since the principle of the method is similar, it will not be repeated here.

It should be noted that in the above second possible implementation manner, the short video generation device may first determine the category corresponding to the probability of the scene category and the probability of the behavior category of each frame of the video image in each video segment. Weight, according to the above method, sum the product of the corresponding probability and category weight to determine the weight probability of each frame of video image, and then divide the sum of the weight probability of each frame of video image by the number of frames to obtain the interest category probability of the video segment.

S204: Determine a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the interest category probability.

The specific implementation of S204 is similar to the two implementations of the first possible implementation scenario in S103, the difference is that S103 is based on the probability of the semantic category to which it belongs, while S204 is based on the probability of the interest category, so the specific implementation For the method, refer to S103, which will not be repeated here. In the same way, in this implementation scenario, subsequent operations can also be performed based on the above-mentioned overlapping segments after the KTS segmentation, which will not be repeated here.

Compared with the two implementations of S103, the interest category probability in S204 comprehensively describes the two dimensions of the importance and interest of the video clip. Therefore, after sorting, you can further select the summary video clip to present as more important and more user-friendly as possible Video clips of interest.

Based on the above technical solutions, it can be seen that, on the basis of ensuring the continuity of short video content and the efficiency of short video generation, the embodiments of this application further analyze user preferences based on the local database and historical operation records, so that they are selected for synthesis When the video clips of short videos are more targeted and more in line with user interests, short videos of thousands of people can be obtained.

FIG. 12 shows a schematic structural diagram of a terminal device 100 as the device for generating a short video.

It should be understood that the terminal device 100 may have more or fewer components than shown in the figure, may combine two or more components, or may have different component configurations. The various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.

The terminal device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2. Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display 194, And subscriber identification module (subscriber identification module, SIM) card interface 195 and so on. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.

The processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait. Among them, the different processing units may be independent devices or integrated in one or more processors.

The controller may be the nerve center and command center of the terminal device 100. The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching instructions and executing instructions.

A memory may also be provided in the processor 110 to store instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.

In some embodiments, the processor 110 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter/receiver (universal asynchronous) interface. receiver/transmitter, UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and / Or Universal Serial Bus (USB) interface, etc.

It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the terminal device 100. In other embodiments of the present application, the terminal device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.

The charging management module 140 is used to receive charging input from the charger. Among them, the charger can be a wireless charger or a wired charger.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140, and supplies power to the processor 110, the internal memory 121, the external memory, the display screen 194, the camera 193, and the wireless communication module 160.

The wireless communication function of the terminal device 100 can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, and the baseband processor.

The terminal device 100 implements a display function through a GPU, a display screen 194, and an application processor. The GPU is an image processing microprocessor, which is connected to the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations and is used for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display screen 194 includes a display panel. The display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode). AMOLED, flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc. In some embodiments, the terminal device 100 may include one or N display screens 194, and N is a positive integer greater than one.

The terminal device 100 can implement a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, and an application processor.

The ISP is used to process the data fed back from the camera 193. For example, when taking a picture, the shutter is opened, the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transmits the electrical signal to the ISP for processing and is converted into an image visible to the naked eye. ISP can also optimize the image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193.

The camera 193 is used to capture still images or videos. The object generates an optical image through the lens and is projected to the photosensitive element. The photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal. ISP outputs digital image signals to DSP for processing. DSP converts digital image signals into standard RGB, YUV and other formats of image signals. In the embodiment of the present invention, the camera 193 includes a camera that collects images required for face recognition, such as an infrared camera or other cameras. The camera that collects the image required for face recognition is generally located on the front of the terminal device, for example, above the touch screen, and may also be located at other positions, which is not limited in the embodiment of the present invention. In some embodiments, the terminal device 100 may include other cameras. The terminal device may also include a dot matrix transmitter (not shown in the figure) for emitting light. The camera collects the light reflected by the face to obtain a face image, and the processor processes and analyzes the face image, and compares it with the stored face image information for verification.

The digital signal processor is used to process digital signals. In addition to digital image signals, it can also process other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.

Video codecs are used to compress or decompress digital video. The terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.

NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, for example, the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn. Through the NPU, applications such as intelligent cognition of the terminal device 100 can be implemented, such as image recognition, face recognition, voice recognition, text understanding, and so on.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the terminal device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.

The internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions. The processor 110 executes various functional applications and data processing of the terminal device 100 by running instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. Among them, the storage program area can store an operating system, at least one application required for a function (such as a face recognition function, a fingerprint recognition function, a mobile payment function, etc.) and so on. The storage data area can store data created during the use of the terminal device 100 (such as face information template data, fingerprint information template, etc.) and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.

The terminal device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.

The speaker 170A, also called "speaker", is used to convert audio electrical signals into sound signals.

The receiver 170B, also called "earpiece", is used to convert audio electrical signals into sound signals.

The microphone 170C, also called "microphone", "microphone", is used to convert sound signals into electrical signals.

The earphone interface 170D is used to connect wired earphones. The earphone interface 170D may be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, or a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be provided on the display screen 194. There are many types of pressure sensors 180A, such as resistive pressure sensors, inductive pressure sensors, capacitive pressure sensors and so on.

The gyro sensor 180B may be used to determine the movement posture of the terminal device 100. In some embodiments, the angular velocity of the terminal device 100 around three axes (ie, x, y, and z axes) can be determined by the gyro sensor 180B.

The proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector such as a photodiode. The light emitting diode may be an infrared light emitting diode.

The ambient light sensor 180L is used to sense the brightness of the ambient light. The terminal device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived brightness of the ambient light. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.

The fingerprint sensor 180H is used to collect fingerprints. The terminal device 100 can use the collected fingerprint characteristics to realize fingerprint unlocking, access application lock, fingerprint photography, fingerprint answering calls, and so on. Wherein, the fingerprint sensor 180H can be arranged under the touch screen, the terminal device 100 can receive a user's touch operation on the touch screen in the area corresponding to the fingerprint sensor, and the terminal device 100 can collect the fingerprint of the user's finger in response to the touch operation. Information to realize that the fingerprint recognition involved in the embodiments of the application opens the hidden album after the fingerprint recognition is passed, the hidden application is opened after the fingerprint recognition is passed, the account is logged in after the fingerprint recognition is passed, and the payment is completed after the fingerprint recognition is passed.

The temperature sensor 180J is used to detect temperature. In some embodiments, the terminal device 100 uses the temperature detected by the temperature sensor 180J to execute a temperature processing strategy.

Touch sensor 180K, also called "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch screen is composed of the touch sensor 180K and the display screen 194, which is also called a “touch screen”. The touch sensor 180K is used to detect touch operations acting on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. The visual output related to the touch operation can be provided through the display screen 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the terminal device 100, which is different from the position of the display screen 194.

The button 190 includes a power-on button, a volume button, and so on. The button 190 may be a mechanical button. It can also be a touch button. The terminal device 100 may receive key input, and generate key signal input related to user settings and function control of the terminal device 100.

The indicator 192 may be an indicator light, which may be used to indicate the charging status, power change, or to indicate messages, missed calls, notifications, and so on.

The SIM card interface 195 is used to connect to the SIM card. The SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact and separation with the terminal device 100. In some embodiments, the terminal device 100 adopts an eSIM, that is, an embedded SIM card. The eSIM card can be embedded in the terminal device 100 and cannot be separated from the terminal device 100.

The software system of the terminal device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiment of the present invention takes an Android system with a layered architecture as an example to illustrate the software structure of the terminal device 100 by way of example.

FIG. 13 is a block diagram of the software structure of the terminal device 100 according to an embodiment of the present application.

The layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.

The application layer can include a series of application packages.

As shown in FIG. 13, the application package may include applications (also called applications) such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.

The application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer. The application framework layer includes some predefined functions.

As shown in Figure 13, the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, and a notification manager.

The window manager is used to manage window programs. The window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc.

The content provider is used to store and retrieve data and make these data accessible to applications. The data may include videos, images, audios, phone calls made and received, browsing history and bookmarks, phone book, etc.

The view system includes visual controls, such as controls that display text, controls that display pictures, and so on. The view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.

The phone manager is used to provide the communication function of the terminal device 100. For example, the management of the call status (including connecting, hanging up, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.

The notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, and so on. The notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialogue interface. For example, prompt text messages in the status bar, sound a prompt tone, terminal equipment vibration, flashing indicator lights, etc.

Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.

The core library consists of two parts: one part is the function functions that the java language needs to call, and the other part is the core library of Android.

The application layer and application framework layer run in a virtual machine. The virtual machine executes the java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.

The system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.

The surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.

The media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files. The media library can support multiple audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.

The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, synthesis, and layer processing.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is the layer between hardware and software. The kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.

FIG. 14 shows a schematic structural diagram of the server 200 as the device for generating short videos.

It should be understood that the server 200 may have more or fewer components than shown in the figure, may combine two or more components, or may have different component configurations. The various components shown in the figure may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.

The server 200 may include a processor 210 and a memory 220, and the processor 210 may be connected to the memory 220 through a bus.

The processor 210 may include one or more processing units. For example, the processor 210 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), and/or neural network processor (neural-network processing unit, NPU), etc. Among them, the different processing units may be independent devices or integrated in one or more processors.

The controller may be the nerve center and command center of the server 200. The controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.

A memory may also be provided in the processor 210 for storing instructions and data. In some embodiments, the memory in the processor 210 is a cache memory. The memory can store instructions or data that have just been used or recycled by the processor 210. If the processor 210 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 210 is reduced, and the efficiency of the system is improved.

In some embodiments, the processor 210 may include one or more interfaces. The interface may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter/receiver (universal asynchronous) interface. receiver/transmitter, UART) interface, and/or universal serial bus (universal serial bus, USB) interface, etc.

It can be understood that the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the server 200. In other embodiments of the present application, the server 200 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.

Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the server 200 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.

Video codecs are used to compress or decompress digital video. The server 200 may support one or more video codecs. In this way, the server 200 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.

NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, for example, the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn. Through the NPU, applications such as intelligent cognition of the server 200 can be realized, such as image recognition, face recognition, voice recognition, text understanding, and so on.

The memory 220 may be used to store computer executable program code, where the executable program code includes instructions. The processor 210 executes various functional applications and data processing of the server 200 by running instructions stored in the memory 220. The memory 220 may include a program storage area and a data storage area. Among them, the storage program area can store an operating system, at least one application required for a function (such as a face recognition function, a fingerprint recognition function, a mobile payment function, etc.) and so on. The storage data area can store data created during the use of the server 200 (such as face information template data, fingerprint information template, etc.) and the like. In addition, the memory 220 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.

Further, the aforementioned server 200 may also be a virtualized server, that is, there are multiple virtualized logical servers on the server 200, and each logical server can rely on the software, hardware and other components in the server 200 to achieve the same data storage and Processing function.

FIG. 15 is a schematic structural diagram of a short video generating apparatus 300 in an embodiment of the application. The short video generating apparatus 300 may be applied to the aforementioned terminal device 100 or server 200. The device 300 for generating a short video may include:

The video acquisition module 310 is used to acquire the target video;

The video analysis module 320 is configured to obtain the start and end times of at least one video segment in the target video and the probability of the semantic category belonging to the target video through semantic analysis; wherein, each of the video segments belongs to one or more semantic categories;

The short video generation module 330 is configured to generate a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs.

In a possible implementation scenario, the target video includes m frames of video images, and the m is a positive integer; the video analysis module 320 is specifically configured to:

Extracting n-dimensional feature data of each frame of video image in the target video, and generating an m*n video feature matrix based on the time sequence of m frames of video image, where n is a positive integer;

Converting the video feature matrix into a multi-layer feature map, and generating at least one corresponding candidate frame on the video feature matrix based on each feature point in the multi-layer feature map;

Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each continuous semantic feature sequence and the probability of the semantic category to which it belongs.

In a possible implementation scenario, the probability of the semantic category includes the probability of the behavior category and the probability of the scene category; the target video includes m frames of video images, and the m is a positive integer; the video analysis The module 320 is specifically used for:

Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each of the continuous semantic feature sequence and the probability of the behavior category to which it belongs;

Identify and output the probability of the scene category of each frame of video image in the target video according to the n-dimensional feature data of each frame of video image in the target video.

In a possible implementation scenario, the width of the at least one candidate frame remains unchanged.

In a possible implementation scenario, the short video generation module 330 is specifically configured to:

Determine the average category probability of the at least one video segment according to the start and end time of each video segment and the probability of the behavior category to which it belongs, and the probability of the scene category of each frame of video image in each video segment;

According to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.

In a possible implementation manner, the short video generation module 330 is specifically configured to:

For each of the video segments, according to the start and end times of the video segments, determine the multiple frames of video images and the number of frames corresponding to the video segments;

Determining the probability of the behavior category of the video clip as the probability of the behavior category of each frame of video image in the video clip;

Acquiring the probability of the scene category of each frame of video image in the multi-frame video image;

The sum of the probability of the behavior category and the probability of the scene category of each frame of the video image in the multi-frame video image is divided by the number of frames to obtain the average category probability of the video segment.

In a possible implementation scenario, the video analysis module 320 is specifically configured to:

Obtaining the start and end time, the semantic category and the probability of the semantic category of at least one video segment in the target video through semantic analysis;

The short video generation module 330 is specifically used for:

Determine the interest category probability of the at least one video clip according to the probability of the semantic category to which each video clip belongs and the category weight corresponding to the semantic category to which it belongs;

According to the start and end time of the at least one video segment and the interest category probability, a short video corresponding to the target video is generated from the at least one video segment.

In a possible implementation scenario, the apparatus 300 further includes:

The information acquisition module 340 is used to acquire media data information in the local database and historical operation records;

The category weight determination module 350 is configured to determine the category weights corresponding to various semantic categories of the media data according to the media data information.

In a possible implementation manner, the category weight determination module 350 is specifically configured to:

Determine the semantic category of videos and images in the local database, and count the number of occurrences of each semantic category;

Determine the semantic category of the videos and images operated by the user in the historical operation record, and count the operation duration and frequency of each semantic category;

According to the number of occurrences, operation duration, and operation frequency of each semantic category, the weight of the category corresponding to each semantic category is calculated.

Determine at least one summary video segment from the at least one video segment in sequence according to the magnitude order and the start and end time of the interest category probability of the at least one video segment;

Obtain the at least one summary video segment and synthesize a short video corresponding to the target video.

Optionally, the sum of the segment durations of the at least one summary video segment is not greater than a preset short video duration.

Intercept the video segment in the target video according to the start and end time of each video segment;

Sorting and displaying each of the video clips according to the order of the probability of the interest category of the at least one video clip;

When receiving a selection instruction for any one or more of the video clips, determining that the selected video clip is a summary video clip;

According to the at least one summary video segment, a short video corresponding to the target video is synthesized.

Perform time domain segmentation on the target video to obtain the start and end times of at least one segment;

Determine at least one overlapping segment between each video segment and each divided segment according to the start and end time of the at least one video segment and the start and end time of the at least one divided segment;

A short video corresponding to the target video is generated from the at least one overlapping segment.

Those of ordinary skill in the art can understand that, in various embodiments of the present application, the size of the sequence number of each process does not mean the order of execution. The execution order of each process should be determined by its function and internal logic. There should be any limitation on the implementation process of the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present invention are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a computer readable storage medium. During execution, it may include the procedures of the above-mentioned method embodiments. Wherein, the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

The above-disclosed are only the preferred embodiments of the present invention, which of course cannot be used to limit the scope of rights of the present invention. Therefore, equivalent changes made according to the claims of the present invention still fall within the scope of the present invention.

Claims

A method for generating short video, which is characterized in that it includes:

Get the target video;

Obtaining the start and end time of at least one video segment in the target video and the probability of the semantic category to which it belongs through semantic analysis; wherein, each of the video segments belongs to one or more semantic categories;

According to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs, a short video corresponding to the target video is generated from the at least one video segment.
The method according to claim 1, wherein the target video includes m frames of video images, and the m is a positive integer; and the semantic analysis obtains the start and end times of at least one video segment in the target video and The probability of belonging to the semantic category includes:

Extracting n-dimensional feature data of each frame of video image in the target video, and generating an m*n video feature matrix based on the time sequence of m frames of video image, where n is a positive integer;

Converting the video feature matrix into a multi-layer feature map, and generating at least one corresponding candidate frame on the video feature matrix based on each feature point in the multi-layer feature map;

Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each continuous semantic feature sequence and the probability of the semantic category to which it belongs.
The method according to claim 1, wherein the probability of belonging to the semantic category includes the probability of belonging to the behavior category and the probability of belonging to the scene category; the target video includes m frames of video images, and the m is a positive integer; The obtaining by semantic analysis of the start and end time of at least one video segment in the target video and the probability of the semantic category to which it belongs includes:

Extracting n-dimensional feature data of each frame of video image in the target video, and generating an m*n video feature matrix based on the time sequence of m frames of video image, where n is a positive integer;

Converting the video feature matrix into a multi-layer feature map, and generating at least one corresponding candidate frame on the video feature matrix based on each feature point in the multi-layer feature map;

Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each of the continuous semantic feature sequence and the probability of the behavior category to which it belongs;

Identify and output the probability of the scene category of each frame of video image in the target video according to the n-dimensional feature data of each frame of video image in the target video.
The method according to claim 3, wherein the generating a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the probability of belonging to a semantic category comprises :

Determine the average category probability of the at least one video segment according to the start and end time of each video segment and the probability of the behavior category to which it belongs, and the probability of the scene category of each frame of video image in each video segment;

According to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.
The method according to claim 4, characterized in that, according to the start and end time of each of the video segments and the probability of the behavior category to which they belong, and the probability of the scene category of each frame of video image in each of the video segments, Determining the average category probability of the at least one video segment includes:

For each of the video segments, according to the start and end times of the video segments, determine the multiple frames of video images and the number of frames corresponding to the video segments;

Determining the probability of the behavior category of the video clip as the probability of the behavior category of each frame of video image in the video clip;

Acquiring the probability of the scene category of each frame of video image in the multi-frame video image;

The sum of the probability of the behavior category and the probability of the scene category of each frame of the video image in the multi-frame video image is divided by the number of frames to obtain the average category probability of the video segment.
The method according to any one of claims 1 to 5, wherein the obtaining the start and end time of at least one video segment in the target video and the probability of belonging to a semantic category through semantic analysis comprises:

Obtaining the start and end time, the semantic category and the probability of the semantic category of at least one video segment in the target video through semantic analysis;

The generating a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs includes:

Determine the interest category probability of the at least one video clip according to the probability of the semantic category to which each video clip belongs and the category weight corresponding to the semantic category to which it belongs;

According to the start and end time of the at least one video segment and the interest category probability, a short video corresponding to the target video is generated from the at least one video segment.
The method according to claim 6, characterized in that, before determining the probability of the interest category of the at least one video clip according to the probability of the semantic category belonging to each of the video clips and the category weight corresponding to the semantic category, Also includes:

Obtain the media data information in the local database and historical operation records;

According to the media data information, the category weights corresponding to various semantic categories of the media data are determined.
The method according to claim 7, wherein the determining, according to the media data information, the category weights corresponding to various semantic categories of the media data respectively comprises:

Determine the semantic category of videos and images in the local database, and count the number of occurrences of each semantic category;

Determine the semantic category of the videos and images operated by the user in the historical operation record, and count the operation duration and frequency of each semantic category;

According to the number of occurrences, operation duration, and operation frequency of each semantic category, the weight of the category corresponding to each semantic category is calculated.
The method according to any one of claims 6-8, wherein the at least one video segment corresponding to the target video is generated from the at least one video segment according to the start and end time and the interest category probability of the at least one video segment. The short video includes:

Determine at least one summary video segment from the at least one video segment in sequence according to the magnitude order and the start and end time of the interest category probability of the at least one video segment;

Obtain the at least one summary video segment and synthesize a short video corresponding to the target video.
The method according to any one of claims 6-8, wherein the at least one video segment corresponding to the target video is generated from the at least one video segment according to the start and end time and the interest category probability of the at least one video segment. The short video includes:

Intercept the video segment in the target video according to the start and end time of each video segment;

Sorting and displaying each of the video clips according to the order of the probability of the interest category of the at least one video clip;

When receiving a selection instruction for any one or more of the video clips, determining that the selected video clip is a summary video clip;

According to the at least one summary video segment, a short video corresponding to the target video is synthesized.
The method according to any one of claims 1-10, wherein the generating a short video corresponding to the target video from the at least one video segment comprises:

Perform time domain segmentation on the target video to obtain the start and end times of at least one segment;

Determine at least one overlapping segment between each video segment and each divided segment according to the start and end time of the at least one video segment and the start and end time of the at least one divided segment;

A short video corresponding to the target video is generated from the at least one overlapping segment.
A short video generation device, which is characterized in that it comprises:

The video acquisition module is used to acquire the target video;

The video analysis module is configured to obtain the start and end time of at least one video segment in the target video and the probability of the semantic category to which it belongs through semantic analysis; wherein, each of the video segments belongs to one or more semantic categories;

The short video generation module is configured to generate a short video corresponding to the target video from the at least one video segment according to the start and end time of the at least one video segment and the probability of the semantic category to which it belongs.
The device according to claim 12, wherein the target video comprises m frames of video images, and the m is a positive integer; and the video analysis module is specifically configured to:

Extracting n-dimensional feature data of each frame of video image in the target video, and generating an m*n video feature matrix based on the time sequence of m frames of video image, where n is a positive integer;

Converting the video feature matrix into a multi-layer feature map, and generating at least one corresponding candidate frame on the video feature matrix based on each feature point in the multi-layer feature map;

Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each continuous semantic feature sequence and the probability of the semantic category to which it belongs.
The apparatus according to claim 12, wherein the probability of the semantic category includes the probability of the behavior category and the probability of the scene category; the target video includes m frames of video images, and the m is a positive integer; The video analysis module is specifically used for:

Extracting n-dimensional feature data of each frame of video image in the target video, and generating an m*n video feature matrix based on the time sequence of m frames of video image, where n is a positive integer;

Converting the video feature matrix into a multi-layer feature map, and generating at least one corresponding candidate frame on the video feature matrix based on each feature point in the multi-layer feature map;

Determine at least one continuous semantic feature sequence according to the candidate frame, and determine the start and end time of the video segment corresponding to each of the continuous semantic feature sequence and the probability of the behavior category to which it belongs;

Identify and output the probability of the scene category of each frame of video image in the target video according to the n-dimensional feature data of each frame of video image in the target video.
The device according to claim 14, wherein the short video generation module is specifically configured to:

Determine the average category probability of the at least one video segment according to the start and end time of each video segment and the probability of the behavior category to which it belongs, and the probability of the scene category of each frame of video image in each video segment;

According to the average category probability of the at least one video segment, a short video corresponding to the target video is generated from the at least one video segment.
The device according to claim 15, wherein the short video generation module is specifically configured to:

For each of the video segments, according to the start and end times of the video segments, determine the multiple frames of video images and the number of frames corresponding to the video segments;

Determining the probability of the behavior category of the video clip as the probability of the behavior category of each frame of video image in the video clip;

Acquiring the probability of the scene category of each frame of video image in the multi-frame video image;

The sum of the probability of the behavior category and the probability of the scene category of each frame of the video image in the multi-frame video image is divided by the number of frames to obtain the average category probability of the video segment.
The device according to any one of claims 12-16, wherein the device further comprises:

The video analysis module is specifically used for:

Obtaining the start and end time, the semantic category and the probability of the semantic category of at least one video segment in the target video through semantic analysis;

The short video generation module is specifically used for:

Determine the interest category probability of the at least one video clip according to the probability of the semantic category to which each video clip belongs and the category weight corresponding to the semantic category to which it belongs;

According to the start and end time of the at least one video segment and the interest category probability, a short video corresponding to the target video is generated from the at least one video segment.
The device according to claim 17, wherein the device further comprises:

The information acquisition module is used to acquire the media data information in the local database and historical operation records;

The category weight determination module is used to determine the category weights corresponding to various semantic categories of the media data according to the media data information.
The device according to claim 18, wherein the category weight determination module is specifically configured to:

Determine the semantic category of videos and images in the local database, and count the number of occurrences of each semantic category;

Determine the semantic category of the videos and images operated by the user in the historical operation record, and count the operation duration and frequency of each semantic category;

According to the number of occurrences, operation duration, and operation frequency of each semantic category, the weight of the category corresponding to each semantic category is calculated.
The device according to any one of claims 17-19, wherein the short video generation module is specifically configured to:

Determine at least one summary video segment from the at least one video segment in sequence according to the magnitude order and the start and end time of the interest category probability of the at least one video segment;

Obtain the at least one summary video segment and synthesize a short video corresponding to the target video.
The device according to any one of claims 17-19, wherein the short video generation module is specifically configured to:

Intercept the video segment in the target video according to the start and end time of each video segment;

Sorting and displaying each of the video clips according to the order of the probability of the interest category of the at least one video clip;

When receiving a selection instruction for any one or more of the video clips, determining that the selected video clip is a summary video clip;

According to the at least one summary video segment, a short video corresponding to the target video is synthesized.
The device according to any one of claims 12-21, wherein the short video generation module is specifically configured to:

Perform time domain segmentation on the target video to obtain the start and end times of at least one segment;

Determine at least one overlapping segment between each video segment and each divided segment according to the start and end time of the at least one video segment and the start and end time of the at least one divided segment;

A short video corresponding to the target video is generated from the at least one overlapping segment.
A terminal device, which is characterized by comprising a memory and a processor, wherein:

The memory is used to store computer readable instructions; the processor is used to read the computer readable instructions and implement the method according to any one of claims 1-11.
A server, which is characterized by comprising a memory and a processor, wherein:

The memory is used to store computer readable instructions; the processor is used to read the computer readable instructions and implement the method according to any one of claims 1-11.
A computer storage medium, characterized in that it stores computer readable instructions, and when the computer readable instructions are executed by a processor, the method according to any one of claims 1-11 is implemented.