WO2018177139A1

WO2018177139A1 - Method and apparatus for generating video abstract, server and storage medium

Info

Publication number: WO2018177139A1
Application number: PCT/CN2018/079246
Authority: WO
Inventors: 曾佩玲
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2017-03-28
Filing date: 2018-03-16
Publication date: 2018-10-04
Also published as: CN106888407A; CN106888407B

Abstract

Disclosed are a method and apparatus for generating a video abstract, a server and a storage medium, which are used for automatically generating different video abstracts for different users, increasing the viewing amount of a video, providing effective information for more users, and improving the efficiency of video abstract generation. The method comprises: segmenting a target video into several video frames; according to a user characteristic, determining a corresponding N target frames from the several video frames, N being an integer greater than 1; extracting a subtitle in the N target frames; and generating a target video abstract according to the subtitle.

Description

Video summary generation method, device, server and storage medium

The present application claims priority to Chinese Patent Application No. 200910192629.4, entitled "A Video Abstract Generation Method and Apparatus", filed on March 28, 2017, the entire contents of which are incorporated by reference. In this application.

Technical field

The embodiments of the present invention relate to the field of computer applications, and in particular, to a video summary generation method, apparatus, server, and storage medium.

Background technique

When the user clicks on the URL to enter the video website or opens the application (APP, Application) of the video website, the video related text description will be displayed on the video website, and its main function is to describe the key content of the video to attract the user to browse. Video, this type of text is called a video summary. The description of the video summary has a significant impact on the number of page views, so how to create a better-performing video summary is a concern for video sites or video producers.

At present, the video summary is manually created, that is, the staff writes a description of the video, and after the completion of the writing, the description is displayed as a video summary on the corresponding website for the user to browse.

Because it is artificially produced, the video summary produced can only be directed to the video itself. The video summary seen by each user is the same, but different users have different preferences. For the same video, different users want to obtain The effective information is not the same, and the manually produced video summaries are less targeted and cannot provide effective information related to the video for each user. In addition, like some serial TV series, there will be updated episodes every day. If you want to update the video summary of each episode with the plot, you need a lot of manpower.

Summary of the invention

The embodiment of the invention provides a method, a device, a server and a storage medium for generating a video summary, which are used to automatically generate different video summaries for different users, improve the browsing amount of videos, provide effective information for more users, and improve the number of users. The efficiency of video summary generation.

In view of this, an aspect of the embodiments of the present invention provides a video summary generating method, which is used in a server, where the method includes:

Segmenting the target video into a number of video frames;

Determining N target frames corresponding to the user from the plurality of video frames according to user characteristics, where N is an integer greater than 1;

Extracting subtitles in the N target frames;

Generating a target video summary based on the subtitles.

An aspect of an embodiment of the present invention provides a video summary generating apparatus, where the apparatus includes:

a segmentation module, configured to divide the target video into a plurality of video frames;

a first determining module, configured to determine, according to user characteristics, N target frames corresponding to the user from the plurality of video frames, where N is an integer greater than 1;

An extracting module, configured to extract subtitles in the N target frames;

And a generating module, configured to generate a target video summary according to the subtitle.

An aspect of an embodiment of the present invention provides a server, where the server includes:

One or more processors; and,

Memory

The memory stores one or more programs, the one or more programs being configured to be executed by the one or more processors, the one or more programs including instructions for performing the following operations:

Segmenting the target video into a number of video frames;

Extracting subtitles in the N target frames;

Generating a target video summary based on the subtitles.

An aspect of an embodiment of the present invention provides a computer readable storage medium, where the storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program, The code set or set of instructions is loaded and executed by the processor to implement a video summary generation method as described above.

It can be seen from the above technical solutions that the embodiments of the present invention have the following advantages:

The embodiment of the present invention may divide the target video into a plurality of video frames, determine N target frames corresponding to the user according to the user characteristics, extract subtitles in the N target frames, and generate a target video summary of the user according to the extracted subtitles. It can be seen that the solution can automatically generate a video summary, and can display different video summaries to different users according to user characteristics, which is more targeted, can improve the video browsing amount, provide effective information for more users, and improve the video summary. The efficiency of the generation.

DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings which are used in the description of the embodiments will be briefly described. It is obvious that the drawings in the following description are only some embodiments of the present invention.

1 is a schematic diagram of an embodiment of a video summary generating system in an embodiment of the present invention;

2 is a flowchart of an embodiment of a video summary generating method in an embodiment of the present invention;

3 is a flowchart of another embodiment of a video summary generating method in an embodiment of the present invention;

4 is a schematic diagram of an embodiment of a video summary generating apparatus in an embodiment of the present invention;

FIG. 5 is a schematic diagram of another embodiment of a video summary generating apparatus according to an embodiment of the present invention; FIG.

FIG. 6 is a schematic diagram of another embodiment of a video summary generating apparatus according to an embodiment of the present invention; FIG.

FIG. 7 is a schematic diagram of another embodiment of a video summary generating apparatus according to an embodiment of the present invention; FIG.

FIG. 8 is a schematic diagram of another embodiment of a video summary generating apparatus according to an embodiment of the present invention; FIG.

FIG. 9 is a schematic diagram of another embodiment of a video summary generating apparatus according to an embodiment of the present invention; FIG.

FIG. 10 is a schematic diagram of another embodiment of a video summary generating apparatus according to an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all embodiments.

The terms "first", "second", "third", "fourth", etc. (if present) in the specification and claims of the embodiments of the invention and the above figures are used to distinguish similar objects without Used to describe a specific order or order. It is to be understood that the data so used may be interchanged as appropriate, such that the embodiments of the invention described herein can be implemented, for example, in a sequence other than those illustrated or described herein. In addition, the terms "comprises" and "comprises" and "the" and "the" are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that comprises a series of steps or units is not necessarily limited to Those steps or units may include other steps or units not explicitly listed or inherent to such processes, methods, products or devices.

The embodiment of the invention provides a method, a device, a server and a storage medium for generating a video summary, which are used to automatically generate different video summaries for each user, improve the browsing amount of the video, provide effective information for more users, and improve the number of users. The efficiency of video summary generation.

In order to facilitate the understanding of the embodiments of the present invention, the following is a brief description of the applicable scenarios of the embodiments of the present invention. Referring to FIG. 1 , a video summary generation method, apparatus, server, and storage medium are provided. A schematic diagram of a system composition.

As shown in FIG. 1, the system may include a service system composed of at least one server 101, and a plurality of terminals 102. The server 101 in the service system may store data for generating a video summary, and transmit the generated video summary to the terminal 102. The terminal 102 can be configured to upload the target video data that needs to generate a video summary to the server 101, and display the video summary returned by the server 101. It should be understood that the terminal 102 is not limited to the personal computer (PC, Personal Computer) shown in FIG. 1 , and may be another device capable of acquiring and displaying a video summary, such as a mobile phone or a tablet computer.

For example, the user can upload the target video to the server 101 through the terminal 102. The server 101 generates a video summary corresponding to the user for each user by using the video summary generating method in the embodiment of the present invention, and returns the terminal to the terminal 102. The user-matched video summary of 102, the terminal 102 then presents the video summary returned by the server to the user.

It should be understood that the video digest generating method, the device, the server, and the storage medium in the embodiments of the present invention are applicable to other scenarios, and are not limited thereto. In order to facilitate the understanding of the embodiments of the present invention, some terms in the embodiments of the present invention are introduced below:

A video frame is a single image of the smallest unit in an image animation. A frame is a still picture, and continuous frames form an image animation, such as a TV. In an image animation, each frame is a still image. The frame is displayed continuously and continuously to form an illusion of motion.

Key frames, to represent the movement or change of any image animation, must at least give two different key states before and after, and the change and connection between the intermediate states between the two key states can be automatically completed by the computer, in Flash. , the frame representing the critical state is called a key frame.

Lens data refers to a piece of video data captured by the camera at one time. It is the basic physical unit of video structuring.

K-means clustering is a typical distance-based clustering algorithm. Distance is used as the evaluation index of similarity, that is, the closer the distance between two objects is, the greater the similarity is. The algorithm considers clusters to be composed of objects that are close together, thus making compact and independent clusters the ultimate goal. The principle of the algorithm is to input the number of clusters k and the database containing n data objects, and finally output the k clusters that meet the standard of the smallest variance. The k clusters have the following characteristics: each cluster itself is as compact as possible, and each cluster is separated as much as possible. The process is as follows: firstly, k objects are arbitrarily selected from n data objects as the initial cluster center; and for other objects remaining, according to their similarity (distance) with these cluster centers, they are respectively assigned to The most similar clusters (represented by cluster centers) get each new cluster; then calculate the cluster center of each new cluster (the mean of all objects in the cluster); repeat this process until The standard measure function begins to converge. The mean square error is generally used as a standard measure function.

It should be understood that the video summary generation method, apparatus, server, and storage medium in the embodiments of the present invention are applicable to the video summary production mentioned above, and can also be applied to other video-related text introductions such as the creation of the text portion of the movie poster. This is not limited here.

Based on the foregoing background, the video summary generation method in the embodiment of the present invention is first introduced. Referring to FIG. 2, an embodiment of the video summary generation method in the embodiment of the present invention includes:

201. Split the target video into several video frames.

When the user needs to create a video summary of the target video, the target video is first input to the video summary generating device, and the video summary generating device acquires the target video and divides the target video into a plurality of video frames. The video summary generating means can be located in the server 101 shown in FIG. The target video may be one or more video sequences, such as a movie, a few episodes of a TV series, or other videos, which are not limited herein.

202. Determine, according to user characteristics, N target frames corresponding to the user from the plurality of video frames.

After the video summary generating device divides the target video into a plurality of video frames, determining N target frames corresponding to the user according to the user characteristics, wherein the target frame is selected from several video frames of the target video, that is, the video summary generating device According to the user characteristics, the N target frames corresponding to the user are selected from a plurality of video frames. The number of the target frames N is an integer greater than 1, and the value of N can be set by the user or the system, which is not limited herein.

203. Extract subtitles in the N target frames.

After determining the N target frames corresponding to the user, the video summary generating device extracts the subtitles in the N target frames corresponding to the user. It should be understood that subtitle refers to the non-image content such as dialogues and actions in TV dramas, movies and other film and television works in the form of words, and also refers to the texts processed in the post-production of film and television works. In addition to the text, the subtitles may also include symbols, expressions, and the like, which are not limited herein.

204. Generate a target video summary according to the extracted subtitles.

After the video summary generating device extracts the subtitles in the N target frames, the target video digest is generated based on the extracted subtitles. It should be understood that the target video summary refers to a video summary of the target video for describing the content of the target video to the user. It should be understood that the target video summary generated from the subtitles should conform to the requirements of natural language and consist of one or more complete sentences.

It should be noted that the embodiment of the present invention is described by taking a video summary for one user as an example. When it is required to generate a video summary for multiple users, steps 201-204 may be performed for each user.

Regardless of which user generates a video digest, the first step is to split the target video into several video frames, and the several video frames divided for different users are the same, so the divided video frames can be multiplexed. . That is, if the video summary is generated for the first user, steps 201-204 need to be performed; if the video summary is generated for subsequent users, several video frames that have been divided can be read, and then steps 202-204 are performed.

The embodiment of the present invention may divide the target video into a plurality of video frames, determine N target frames corresponding to each user according to the user characteristics, extract subtitles in the N target frames, and generate a target video of the user according to the extracted subtitles. Summary. It can be seen that the solution can automatically generate a video summary, and can display different video summaries to different users according to user characteristics, which is more targeted, can improve the video browsing amount, provide effective information for more users, and improve the video summary. The efficiency of the generation.

Based on the embodiment corresponding to FIG. 2, the target video can be divided into video frames in a plurality of manners, and the manner of determining the target frame is different according to different manners, and the video in the embodiment of the present invention is taken as an example. For a detailed description of the method for generating a summary, please refer to FIG. 3, another embodiment of the method for generating a video summary in the embodiment of the present invention includes:

301. Divide the target video into a plurality of lens data;

When the user needs to create a video summary of the target video, the target video is first input to the video summary generating device, and the video summary generating device acquires the target video, and divides the target video into a plurality of lens data, for example, according to the distance of the color space. Or other parameters are divided, which is not limited here. The video summary generating means can be located in the server 101 shown in FIG. The target video may be one or more video sequences, such as a movie, a few episodes of a TV series, or other videos, which are not limited herein.

302. Divide each lens data into a plurality of sub-shot data;

After the target video is segmented into a plurality of lens data, each lens data is also divided into sub-lens data. For example, the segmentation may be performed according to other parameters such as the camera motion direction, which is not limited herein.

303. Divide each sub-lens data into a plurality of video frames.

The video summary generating means divides each shot data into a plurality of sub-shot data, and also divides each sub-lens data into a plurality of video frames.

304. Determine, according to user characteristics, L sub-shot data corresponding to the user from the plurality of video frames;

After the video summary generating device divides each lens data into a plurality of sub-lens data, the L sub-shot data corresponding to the user is determined according to the user feature, that is, the video summary generating device selects the L corresponding to the user from the plurality of video frames according to the user feature. Sub-shot data, L is an integer equal to or greater than one.

In this embodiment, the video summary generating device may determine, in the sub-shot data corresponding to the target video, target sub-lens data including the tag information corresponding to the user, and determine, in the target sub-shot data, the preset sub-lens weights before the ranking L. Sub-shot data.

It should be noted that the sub-lens weights in the embodiment of the present invention may be determined by the video summary generating device dividing each sub-lens data into a plurality of video frames, and according to the duration length of the sub-shots, the sub-shots are included. The number of video frames is used as the value of the weight of the sub-lens. In addition to the number of video frames, the weight of the sub-lens may be determined according to the weight of the video frame included in the sub-lens, and may be determined according to other parameters, which is not limited herein.

It should be noted that the tag information corresponding to the user in the embodiment of the present invention may be the name of the actor in the user tag, may be the name of the director in the user tag, may be the type of the movie in the user tag, or may be in the user tag. Other information is not limited here.

It should be understood that if the user does not have corresponding label information, the video summary generating apparatus may directly use the sub-lens data of the top L-weight of the sub-lens weight as the L sub-shot data corresponding to the user. The sub-lens data of the top L of the sub-lens weight ranking may be determined by the following method: the video summary generating device sorts all the sub-shot data according to the sub-lens weights in descending order, and selects the sorted sub-shot data from the sorted sub-lens data. The sub-lens data ranked in the top L, and the selected L sub-lens data are used as L sub-shot data corresponding to the user.

If the number M of target sub-lens data is less than L, after the video digest generating device selects all the target sub-shot data, the remaining L-M target sub-shot data are further selected from the sub-shot data corresponding to the target video according to the sub-shot weight. That is, if the number M of target sub-lens data including the tag information corresponding to the user is less than L, the video digest generating device selects all the target sub-lens data, and the sub-lens weights are not in the order of the target video. The selected sub-shot data is sorted, and the remaining LM target sub-shot data are selected from the sorted sub-shot data.

It should be understood that, in addition to the tag information corresponding to the user, the video summary generating device may determine the target sub-lens data according to the video information viewed by the user, the video information collected by the user, and the keyword searched by the user, which is not limited herein. .

305. Determine, according to preset frame weights, X target frames in each sub-shot data of the L sub-shot data.

After determining the L sub-shot data corresponding to the user, the video summary generating device determines X target frames in each of the L sub-shot data according to the preset frame weight. X is an integer equal to or greater than 1, and X is multiplied by L equal to N.

It should be understood that the frame weight is determined after the video summary generating device divides the sub-lens data into several video frames, and may be determined by: for each sub-shot data, the video frames in the sub-shot data are clustered by K-means. Divided into class K, the video frames closest to the cluster center in each type of video frame are determined as key frames of the video frame, and the frame weight of each key frame is determined according to the frame parameters. Among them, the frame parameters include the proportion of the face, or the direction of camera movement, or the focal length of the camera, or whether the camera is rocking, or other parameters.

Each of the sub-lens data herein may be in all of the sub-shot data in the target video, or may be in the L sub-shot data determined for the user, which is not limited herein.

Correspondingly, after the frame weight is determined according to the foregoing manner, the video summary generating apparatus may determine a key frame included in each of the L sub-shot data, and determine, in the L-sub-shot data, a key frame included in each sub-shot data, The X video frames with the largest frame weight, the X video frames are the X target frames in the sub-shot data.

In addition to the above manner, the video summary generating apparatus may determine the frame weight and the X target frames by other means, which are not limited herein.

306. Extract subtitles in the N target frames.

After determining the target frame corresponding to the user, the video summary generating device extracts the subtitles in the N target frames corresponding to the user. It should be understood that subtitle refers to the non-image content such as dialogues and actions in TV dramas, movies and other film and television works in the form of words, and also refers to the texts processed in the post-production of film and television works. In addition to the text, the subtitles may also include symbols, expressions, and the like, which are not limited herein.

The video summary generating device can extract the subtitles as follows:

(1) For each of the N target frames, extract all the subtitles in the target frame, that is, extract all the subtitles in the N target frames.

(2) For each of the N target frames, a preset length of the subtitle in the target frame is extracted. It should be understood that the preset length is set by the user or the video summary generating device, and the preset length may be a limitation on the number of characters, a limitation on the number of sentences, or a limitation on the paragraph, for example, a preset length. It can be 30 words, it can be 3 sentences, it can be 1 paragraph, or it can be other lengths, which is not limited here.

(3) For each of the N target frames, a caption of a certain length before and after the target frame is extracted. It should be understood that the front and rear refer to the order in which the subtitles appear in the target frame, and the length is a preset length, which is similar to the preset length, and will not be described here. For ease of understanding, the following description is given by way of example: for each target frame, the first three sentences and the last three sentences in the subtitles of the target frame are extracted. It should be understood that the above is only an example and does not constitute a limitation of the embodiments of the present invention.

It should also be understood that, in addition to the above manners, the subtitles in the target frame may be extracted by other means, which is not limited herein.

307. Generate a target video summary according to the subtitle.

In this embodiment, the video summary generating apparatus may generate the target video summary by:

The plurality of keywords in the subtitle are extracted, and the extracted plurality of keywords are combined to generate at least one sentence, and the composed one or more sentences constitute a target video summary corresponding to the user. It should be understood that the keyword may be a word whose frequency of occurrence in the subtitle is greater than a preset value, may be a word whose word form is a preset type in the subtitle, may be a word in the subtitle that matches a preset word, or may be determined by other means. The words are not limited here. It should be understood that the sentence generated by the combination should satisfy the natural language requirement and should be a complete sentence.

The video summary generating device may also generate a target video summary corresponding to the user by other means, which is not limited herein.

It should be noted that the embodiment of the present invention is described by taking a video summary for one user as an example. When it is required to generate a video summary for multiple users, steps 301-307 may be performed for each user.

Regardless of which user generates the video digest, the first three steps divide the target video into several video frames, and the several video frames divided for different users are the same, so the divided video frames can be multiplexed. . That is, if the video digest is generated for the first user, steps 301-307 are required; if the video digest is generated for subsequent users, a plurality of video frames that have been divided may be read, and then steps 304-307 are performed.

It should also be understood that, in the embodiment of the present invention, after generating a video digest for each user, the video digest generating device may further update the video digest according to a preset rule. The preset rule refers to a preset update rule, which may be a time period, that is, the video summary is updated periodically, such as updating once a week, updating once a month, etc., and may be a trigger condition, such as an episode of each episode of the TV series. Then update the video summary, and it can be other rules, which is not limited here.

Secondly, the embodiment of the invention provides a method for dividing a target video into a plurality of video frames, which improves the achievability of the solution.

The embodiment of the present invention provides a plurality of manners for determining a target frame, and various manners of extracting subtitles and generating a digest, thereby improving the flexibility of the solution.

Further, the embodiment of the present invention may update the video summary to further improve the timeliness of the video summary.

For ease of understanding, the video summary generation method in the embodiment of the present invention is described in detail in an application scenario:

The system inputs two videos (target video) of the first episode and the second episode of the TV series "Small Divor", and the video summary generating device divides the two videos into six lens data according to the color space distance, and then the six lens data. The 24 sub-lens data is divided into 24 sub-lens data according to the moving direction of the camera, and then the 24 sub-lens data is divided into 100 video frames.

After dividing the target video into 100 video frames, the video summary generating means uses the number of video frames included in the sub-lens data as the weight of the sub-lens data. At the same time, the video summary generating device divides the video frames in the sub-shot data into three categories by mean clustering for each sub-shot data, and determines the video frames closest to the cluster center in each type of video frames as the video frames. The key frame, that is, each sub-lens data corresponds to three key frames, and then the frame weight corresponding to the key frame is determined according to the proportion of the face in the image corresponding to the key frame.

There are two users A and B. The label information corresponding to user A is Haiqing, and user B does not set label information. The video summary generating device determines the first three (L=3) sub-shot data of the 24 sub-shot data in the target video, which is the top 3 sub-lens data of the sub-lens weight ranking, as user B. The sub-shot data is recorded as a, b, and c, respectively. At the same time, the video summary generating means determines that the target video contains sub-shot data of Haiqing, and the result shows that there are 15 sub-lens data including Haiqing (target sub-shot data), and then the video summary generating means determines the 15 sub-shot data. The first 3 (L=3) sub-shot data with the largest number of video frames, that is, the sub-lens data of the top 3 sub-lens weights are selected from the 15 sub-shot data, and the three sub-lens data are b, c, d respectively. .

After determining the three sub-shot data (a, b, c) corresponding to A, the video summary generating means determines the key frames in a, b, c, and then according to the frame weight of the determined key frame, the three keys included from a The key frame a1 (X=1) with the largest frame weight is selected in the frame, and one key frame b1 with the largest frame weight is selected from the three key frames included in b, and the frame weight is the largest among the three key frames included in c. 1 key frame c1, then a1, b1 and c1 are taken as the target frames corresponding to A.

After determining the three target frames corresponding to A, the video summary generating device extracts all the subtitles in the three target frames, wherein the subtitle corresponding to a1 is “Dad, my English test fails”, “Flowing, how can fail” "Is it not always good at English?" "Mom knows that I must marry me. Can you go to the parent meeting on Sunday?" "Well, I will go to the parent meeting on Sunday." The subtitles corresponding to b1 are: "The English score failed to pass the mother, and the mother did not put it in the eyes." The subtitle corresponding to c1 is "Flowering, how can I bring the dog back without my consent? I can't raise a dog at home." "I always wanted to raise a dog, and you promised me."

The video summary generating device extracts keywords "flowering", "English grade", "fail", "dad", "go to the parent club", "want to raise a dog", according to the subtitles corresponding to a1, b1 and c1. Without consent, "mothering the mother", and then combining these subtitles, the sentence "following English scores failed, my father took the mother to open a parent meeting. The blossoming wants to raise a dog", the above sentence is A corresponding video summary.

After determining the three sub-shot data (b, c, d) corresponding to B, the video summary generating means determines the key frame in b, c, d, and then according to the frame weight of the determined key frame, the three keys included from b The key frame b1 with the largest frame weight is selected in the frame, and one key frame c1 with the largest frame weight is selected from the three key frames included in c, and one key frame d1 with the largest frame weight is selected from the three key frames included in d. Then, b1, c1, and d1 are taken as the target frames corresponding to B. After determining the three target frames corresponding to B, the video summary generating means extracts all the subtitles in the three target frames, and the subtitles corresponding to b1 and c1 are as described above, and the subtitle corresponding to d1 is: "Flower, mother invited you. English tutor, you have to cooperate with the teacher to improve your English score." The above sentence is the video summary of the target video corresponding to B.

The video summary generating device extracts keywords "flowering", "English grade", "fail", "dad", "go to the parent club", "kick", "mother" according to the subtitles corresponding to b1, c1 and d1. "," invited English tutor", "improved" and then combined these subtitles to generate a sentence "The blossoming English score failed, Dad took the mother to open the parent meeting. Mom asked English tutor to improve the English score" The above sentence is the video summary corresponding to B.

In addition, the video summary generating device presets an update rule: the TV show updates the video summary every two episodes of the update. A week later, the TV series "Small Separation" updated two episodes. The system entered the third and fourth episodes of "Small Divorce", and the generating device updates the video corresponding to each user according to the newly input Episode 3 video and Episode 4 video. Summary.

The video digest generating method in the embodiment of the present invention is described above. The video digest generating apparatus in the embodiment of the present invention is described below. Referring to FIG. 4, an embodiment of the video digest generating apparatus in the embodiment of the present invention includes:

a segmentation module 401, configured to divide the target video into a plurality of video frames;

The first determining module 402 is configured to determine, according to the user feature, N target frames corresponding to the user from the plurality of video frames, where N is an integer greater than one;

An extracting module 403, configured to extract subtitles in the N target frames;

The generating module 404 is configured to generate a target video summary according to the subtitles extracted by the extracting module 403.

Based on the embodiment corresponding to FIG. 4, referring to FIG. 5, in another embodiment of the video summary generating apparatus provided by the embodiment of the present invention, the generating module 404 includes:

a first extracting unit 4041, configured to extract a plurality of keywords in the subtitle;

The generating unit 4042 is configured to combine a plurality of keywords to generate at least one sentence, and use at least one sentence as the target video summary.

Optionally, in the embodiment of the present invention, the extracting module 403 may include:

a second extracting unit 4031, configured to extract, for each target frame of the N target frames, all the subtitles in the target frame;

or,

The third extracting unit 4032 is configured to extract a preset length of the subtitle in the target frame for each of the N target frames.

The embodiment of the invention provides an implementation manner for generating a video summary, which improves the achievability of the solution.

Secondly, the embodiment of the present invention provides a plurality of ways of extracting subtitles in a target frame, which improves the flexibility of the solution.

Based on the embodiment corresponding to FIG. 4 or FIG. 5, please refer to FIG. 6. In another embodiment of the video summary generating apparatus provided by the embodiment of the present invention, the segmentation module 401 includes:

a first dividing unit 4011, configured to divide the target video into a plurality of lens data;

a second dividing unit 4012, configured to divide each lens data into a plurality of sub-lens data;

The third dividing unit 4013 is configured to divide each sub-lens data into a plurality of video frames.

The embodiment of the invention provides an implementation manner of splitting the target video, and improves the achievability of the solution.

Based on the embodiment corresponding to FIG. 6 above, referring to FIG. 7, in another embodiment of the video summary generating apparatus provided by the embodiment of the present invention, the first determining module 402 includes:

The first determining unit 4021 is configured to determine L sub-shot data corresponding to the user from the plurality of video frames according to the user feature, where L is an integer equal to or greater than 1;

The second determining unit 4022 determines, according to the preset frame weight, X target frames in each of the L sub-shot data, where X is an integer equal to or greater than 1, and X is multiplied by L equal to N.

The embodiment of the invention provides an implementation manner for determining a target frame, and improves the achievability of the solution.

Based on the embodiment corresponding to FIG. 7 above, referring to FIG. 8, in another embodiment of the video summary generating apparatus provided by the embodiment of the present invention, the first determining unit 4021 includes:

The first determining sub-unit 40211 is configured to determine, in the plurality of sub-shot data corresponding to the target video, the target sub-shot data including the tag information corresponding to the user;

The second determining sub-unit 40212 is configured to determine, in the target sub-shot data, the sub-lens data of the top L of the preset sub-lens weights.

In the embodiment of the present invention, the video summary generating apparatus provides a method for determining L sub-shot data corresponding to each user, thereby improving the achievability of the solution.

Based on the embodiment corresponding to FIG. 7 or FIG. 8 , referring to FIG. 9 , in another embodiment of the video summary generating apparatus provided by the embodiment of the present invention, the video summary generating apparatus further includes:

a classification module 405, configured to divide the video frames in the sub-shot data into K classes by K-means clustering for each sub-shot data;

a second determining module 406, configured to determine, in each type of video frame, a video frame that is closest to a cluster center as a key frame of the video frame;

a third determining module 407, configured to determine a frame weight of each key frame according to the frame parameter;

The second determining unit 4022 includes:

The third determining sub-unit 40221 is configured to determine, for each of the L sub-shot data, the X target frames with the largest frame weight among the key frames included in the sub-shot data.

The embodiment of the invention provides a method for determining a target frame in L sub-shot data, which improves the achievability of the solution.

In another embodiment of the video digest generating apparatus provided by the embodiment of the present invention, the video digest generating apparatus may further include:

An update module for updating a video summary according to a preset rule.

In the embodiment of the present invention, the video summary generating apparatus may further update the video summary according to the preset rule, thereby improving the flexibility of the solution.

The video summary generating apparatus in the embodiment of the present invention is described above from the perspective of a functional module. The video summary generating apparatus in the embodiment of the present invention is introduced from the perspective of a hardware entity. Referring to FIG. 10, FIG. 10 is an embodiment of the present invention. A schematic diagram of the structure of the video summary generating device 50. The video summary generating device 50 can include an input device 510, an output device 520, a processor 530, and a memory 540. The output device in the embodiment of the present invention may be a display device.

Memory 540 can include read only memory and random access memory and provides instructions and data to processor 530. A portion of the memory 540 may also include a non-volatile random access memory (English name: Non-Volatile Random Access Memory, English abbreviation: NVRAM).

Memory 540 stores the following elements, executable modules or data structures, or subsets thereof, or their extended sets:

Operation instructions: include various operation instructions for implementing various operations.

Operating system: Includes a variety of system programs for implementing various basic services and handling hardware-based tasks.

In the embodiment of the present invention, the processor 530 is configured to:

Segmenting the target video into a number of video frames;

Determining N target frames corresponding to the user from a plurality of video frames according to user characteristics, where N is an integer greater than one;

Extracting subtitles in N target frames;

Generate a target video summary based on the captions.

In the embodiment of the present invention, the processor 530 is configured to:

Extracting a plurality of keywords in the subtitle;

Combining the plurality of keywords, generating at least one sentence, using the at least one sentence as the target video summary.

In the embodiment of the present invention, the processor 530 is configured to:

Extracting all subtitles in the target frame for each of the N target frames;

or,

For each of the N target frames, a preset length of the subtitle in the target frame is extracted.

In the embodiment of the present invention, the processor 530 is configured to:

Segmenting the target video into a plurality of shot data;

Dividing each lens data into a plurality of sub-shot data;

Each sub-lens data is segmented into several video frames.

In the embodiment of the present invention, the processor 530 is configured to:

Determining, according to the user feature, L sub-shot data corresponding to the user from the plurality of video frames, where L is an integer equal to or greater than 1;

Determining X target frames in each of the L sub-shot data according to a preset frame weight, the X being an integer equal to or greater than 1, and the X multiplied by the L being equal to the N.

In the embodiment of the present invention, the processor 530 is configured to:

Determining, in a plurality of sub-lens data corresponding to the target video, target sub-lens data including tag information corresponding to the user;

Determining, in the target sub-shot data, sub-lens data of the top L of the preset sub-lens weights.

In the embodiment of the present invention, the processor 530 is configured to:

For each sub-shot data, the video frames in the sub-shot data are divided into K classes by K-means clustering;

Determining, in each type of video frame, a video frame closest to a cluster center as a key frame of the one type of video frame;

Determining the frame weight of each key frame according to a frame parameter;

In the embodiment of the present invention, the processor 530 is configured to:

Determining X target frames with the largest frame weight among the key frames included in the sub-shot data for each of the L sub-shot data.

The processor 530 controls the operation of the video summary generating device 50. The processor 530 may also be referred to as a central processing unit (English full name: Central Processing Unit: CPU). Memory 540 can include read only memory and random access memory and provides instructions and data to processor 530. A portion of the memory 540 may also include an NVRAM. In practical applications, the various components of the video summary generating device 50 are coupled together by a bus system 550. The bus system 550 may include a power bus, a control bus, a status signal bus, and the like in addition to the data bus. However, for clarity of description, various buses are labeled as bus system 550 in the figure.

The method disclosed in the foregoing embodiments of the present invention may be applied to the processor 530 or implemented by the processor 530. Processor 530 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the foregoing method may be completed by an integrated logic circuit of hardware in the processor 530 or an instruction in a form of software. The processor 530 may be a general-purpose processor, a digital signal processor (English name: Digital Signal Processing, English abbreviation: DSP), an application specific integrated circuit (English name: Application Specific Integrated Circuit, English abbreviation: ASIC), ready-made programmable Gate array (English name: Field-Programmable Gate Array, English abbreviation: FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present invention may be implemented or carried out. The general purpose processor may be a microprocessor or the processor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present invention may be directly implemented by the hardware decoding processor, or may be performed by a combination of hardware and software modules in the decoding processor. The software module can be located in a conventional storage medium such as random access memory, flash memory, read only memory, programmable read only memory or electrically erasable programmable memory, registers, and the like. The storage medium is located in the memory 540, and the processor 530 reads the information in the memory 540 and performs the steps of the above method in combination with its hardware.

An embodiment of the present invention provides a computer readable storage medium, where the storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program, and the A code set or set of instructions is loaded and executed by the processor to implement a video summary generation method as described above.

A person skilled in the art can clearly understand that for the convenience and brevity of the description, the working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the embodiments of the present invention may contribute to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage. The medium includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read only memory (English full name: Read-Only Memory, English abbreviation: ROM), a random access memory (English full name: Random Access Memory, English abbreviation: RAM), magnetic A variety of media that can store program code, such as a disc or a disc.

The above embodiments are only used to explain the technical solutions of the present application, and are not limited thereto; although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still The technical solutions described in the embodiments are modified, or the equivalents of the technical features are replaced by the equivalents of the technical solutions of the embodiments of the present invention.

Claims

A method for generating a video summary, which is used in a server, the method includes:

Segmenting the target video into a number of video frames;

Determining N target frames corresponding to the user from the plurality of video frames according to user characteristics, where N is an integer greater than 1;

Extracting subtitles in the N target frames;

Generating a target video summary based on the subtitles.
The method according to claim 1, wherein the generating a target video summary according to the subtitle comprises:

Extracting a plurality of keywords in the subtitle;

Combining the plurality of keywords, generating at least one sentence, using the at least one sentence as the target video summary.
The method according to claim 1, wherein the extracting the subtitles in the N target frames comprises:

Extracting all subtitles in the target frame for each of the N target frames;

or,

For each of the N target frames, a preset length of the subtitle in the target frame is extracted.
The method according to any one of claims 1 to 3, wherein the dividing the target video into a plurality of video frames comprises:

Segmenting the target video into a plurality of shot data;

Dividing each lens data into a plurality of sub-shot data;

Each sub-lens data is segmented into several video frames.
The method according to claim 4, wherein the determining the N target frames corresponding to the user from the plurality of video frames according to the user feature comprises:

Determining, according to the user feature, L sub-shot data corresponding to the user from the plurality of video frames, where L is an integer equal to or greater than 1;

Determining X target frames in each of the L sub-shot data according to a preset frame weight, the X being an integer equal to or greater than 1, and the X multiplied by the L being equal to the N.
The method according to claim 5, wherein the determining the L sub-shot data corresponding to the user from the plurality of video frames according to the user feature comprises:

Determining, in a plurality of sub-lens data corresponding to the target video, target sub-lens data including tag information corresponding to the user;

Determining, in the target sub-shot data, sub-lens data of the top L of the preset sub-lens weights.
The method of claim 5, wherein the method further comprises:

For each sub-shot data, the video frames in the sub-shot data are divided into K classes by K-means clustering;

Determining, in each type of video frame, a video frame closest to a cluster center as a key frame of the one type of video frame;

Determining the frame weight of each key frame according to a frame parameter;

Determining, according to the preset frame weight, the X target frames in each of the L sub-shot data, including:

Determining X target frames with the largest frame weight among the key frames included in the sub-shot data for each of the L sub-shot data.
A video summary generating device, the device comprising:

a segmentation module, configured to divide the target video into a plurality of video frames;

a first determining module, configured to determine, according to user characteristics, N target frames corresponding to the user from the plurality of video frames, where N is an integer greater than 1;

An extracting module, configured to extract subtitles in the N target frames;

And a generating module, configured to generate a target video summary according to the subtitle.
The device according to claim 8, wherein the generating module comprises:

a first extracting unit, configured to extract a plurality of keywords in the subtitle;

And a generating unit, configured to combine the plurality of keywords, generate at least one sentence, and use the at least one sentence as the target video summary.
The device according to claim 8, wherein the extraction module comprises:

a second extracting unit, configured to extract all subtitles in the target frame for each of the N target frames;

or,

And a third extracting unit, configured to extract a preset length of the subtitle in the target frame for each of the N target frames.
The apparatus according to any one of claims 8 to 10, wherein the segmentation module comprises:

a first dividing unit, configured to divide the target video into a plurality of lens data;

a second dividing unit, configured to divide each lens data into a plurality of sub-shot data;

And a third dividing unit, configured to divide each sub-lens data into a plurality of video frames.
The device according to claim 11, wherein the first determining module comprises:

a first determining unit, configured to determine, according to the user feature, L sub-shot data corresponding to the user from the plurality of video frames, where L is an integer equal to or greater than 1;

a second determining unit, configured to determine, according to a preset frame weight, X target frames in each of the L sub-shot data, wherein the X is an integer equal to or greater than 1, and the X is multiplied by the L is equal to the N.
The device according to claim 12, wherein the first determining unit comprises:

a first determining subunit, configured to determine target sub-shot data of the tag information corresponding to the user, among the plurality of sub-lens data corresponding to the target video;

a second determining subunit, configured to determine, in the target sub-shot data, the sub-lens data of the top L of the preset sub-lens weights.
The device of claim 12, wherein the device further comprises:

a classification module, configured to classify the video frames in the sub-shot data into K classes by K-means clustering for each sub-shot data;

a second determining module, configured to determine, in each type of video frame, a video frame that is closest to a cluster center as a key frame of the one type of video frame;

a third determining module, configured to determine the frame weight of each key frame according to a frame parameter;

The second determining unit includes:

And a third determining subunit, configured to determine, for each of the L sub-shot data, X target frames with the largest frame weight among the key frames included in the sub-shot data.
A server, wherein the server comprises:

One or more processors; and,

Memory

The memory stores one or more programs, the one or more programs being configured to be executed by the one or more processors, the one or more programs including instructions for performing the following operations:

Segmenting the target video into a number of video frames;

Determining N target frames corresponding to the user from the plurality of video frames according to user characteristics, where N is an integer greater than 1;

Extracting subtitles in the N target frames;

Generating a target video summary based on the subtitles.
The server of claim 15 wherein said one or more programs further comprise instructions for:

Extracting a plurality of keywords in the subtitle;

Combining the plurality of keywords, generating at least one sentence, using the at least one sentence as the target video summary.
The server of claim 15 wherein said one or more programs further comprise instructions for:

Extracting all subtitles in the target frame for each of the N target frames;

or,

For each of the N target frames, a preset length of the subtitle in the target frame is extracted.
A server according to any one of claims 15 to 17, wherein the one or more programs further comprise instructions for performing the following operations:

Segmenting the target video into a plurality of shot data;

Dividing each lens data into a plurality of sub-shot data;

Each sub-lens data is segmented into several video frames.
The server of claim 18, wherein the one or more programs further comprise instructions for performing the following operations:

Determining, according to the user feature, L sub-shot data corresponding to the user from the plurality of video frames, where L is an integer equal to or greater than 1;

Determining X target frames in each of the L sub-shot data according to a preset frame weight, the X being an integer equal to or greater than 1, and the X multiplied by the L being equal to the N.
The server of claim 19, wherein the one or more programs further comprise instructions for performing the following operations:

Determining, in a plurality of sub-lens data corresponding to the target video, target sub-lens data including tag information corresponding to the user;

Determining, in the target sub-shot data, sub-lens data of the top L of the preset sub-lens weights.
The server of claim 19, wherein the one or more programs further comprise instructions for performing the following operations:

For each sub-shot data, the video frames in the sub-shot data are divided into K classes by K-means clustering;

Determining, in each type of video frame, a video frame closest to a cluster center as a key frame of the one type of video frame;

Determining the frame weight of each key frame according to a frame parameter;

Determining X target frames with the largest frame weight among the key frames included in the sub-shot data for each of the L sub-shot data.
A computer readable storage medium, wherein the storage medium stores at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction, the at least one program, the code set or An instruction set is loaded and executed by the processor to implement a video summary generation method as claimed in any one of claims 1 to 7.