CN113542910A

CN113542910A - Method, device and equipment for generating video abstract and computer readable storage medium

Info

Publication number: CN113542910A
Application number: CN202110712223.0A
Authority: CN
Inventors: 卞东海; 郑烨翰; 彭卫华; 徐伟建
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2021-10-22

Abstract

According to the exemplary embodiments of the present disclosure, a method, an apparatus, a device, and a computer-readable storage medium for generating a video summary are provided, which relate to the field of artificial intelligence, and in particular to the fields of knowledge maps, deep learning, and computer vision. The specific implementation scheme is as follows: acquiring at least one frame of image in a video; determining content information of at least one frame of image, wherein the content information represents main content contained in the at least one frame of image; carrying out structuring processing on the content information to generate structured information; and generating a summary of the video based on the structured information. The embodiment of the disclosure can efficiently and automatically generate the video abstract for various videos, and can save a large amount of manpower input.

Description

Method, device and equipment for generating video abstract and computer readable storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method, apparatus, electronic device, computer-readable storage medium, and computer program product for generating a video summary.

Background

In many instances, it is desirable to describe the content of a video to generate a summary of the video for subsequent retrieval or other applications. Taking a popular variety video as an example, when the entertainment industry splits, searches or secondarily utilizes the variety video, the main content in the video needs to be known first. For another example, in a video website, important contents of a video need to be introduced to attract a user to click into browsing the video, so that a video abstract is provided to describe the video contents. The video abstract has important significance on the browsing amount and subsequent secondary utilization of the video.

Currently, most of the videos are watched in a manual mode, and then video summaries are written manually, so that time and labor are consumed.

Disclosure of Invention

The present disclosure provides a method, apparatus, electronic device, computer-readable storage medium, and computer program product for generating a video summary.

According to a first aspect of the present disclosure, a method of generating a video summary is provided. The method comprises the following steps: acquiring at least one frame of image in a video; determining content information of at least one frame of image, wherein the content information represents main content contained in the at least one frame of image; carrying out structuring processing on the content information to generate structured information; and generating a video summary based on the structured information.

In a second aspect of the present disclosure, an apparatus for generating a video summary is provided. The device includes: the image acquisition module is configured to acquire at least one frame of image in a video; a content information determination module configured to determine content information of the at least one frame of image, the content information representing main content contained in the at least one frame of image; the structured information generation module is configured to perform structured processing on the content information to generate structured information; and the video abstract generating module is used for generating the abstract of the video based on the structural information.

In a third aspect of the disclosure, an electronic device is provided that includes one or more processors; and storage means for storing the one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method according to the first aspect of the disclosure.

In a fourth aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements a method according to the first aspect of the present disclosure.

In a fifth aspect of the present disclosure, a computer program product is provided, which computer program, when executed by a processor, implements the method according to the first aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;

FIG. 2 shows a flow diagram for generating a video summary according to an embodiment of the present disclosure;

FIG. 3 shows a block diagram of an apparatus for generating a video summary according to an embodiment of the present disclosure;

FIG. 4 shows a block diagram of an apparatus for generating a video summary according to another embodiment of the present disclosure; and

FIG. 5 illustrates a block diagram of a computing device capable of implementing various embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In describing embodiments of the present disclosure, the terms "include" and its derivatives should be interpreted as being inclusive, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As mentioned previously, the content of a video needs to be described to generate a summary of the video for subsequent retrieval or other applications. Currently, most of the videos are watched manually, and then each video is described manually to generate a video summary. This results in time and effort. Furthermore, for the variety videos, since there are many people involved, how to organize the video content to obtain a proper description of the content is also a problem.

Moreover, for the same video, different people watch the video, and the obtained information varies, even varies greatly due to various reasons. Even if substantially the same information is acquired, the described video summaries are thus quite different due to personal preferences, habits, etc. of different people.

In addition, different users have different preferences, and for the same video, the information that different users want to obtain is different, and the pertinence of the artificially created video abstract is poor, and the pertinent video information cannot be provided for each user.

The above difficulties increase the difficulty of efficiently generating an ideal video summary. To solve the above problem, embodiments of the present disclosure provide an improved scheme for generating a video summary. In this approach, a computing device acquires at least one image in a video. The computing device then analyzes the acquired at least one frame of image, and specifically determines content information of the at least one frame of image, the content information representing primary content contained in the at least one frame of image. Thereafter, the computing device performs structured processing on the content information to generate structured information. By generating the structured information, various subsequent processes such as extraction of the information are facilitated. The computing device generates a video summary based on the structured information. By the scheme, the video abstract can be efficiently and automatically generated for various videos, so that a large amount of labor input can be saved. Meanwhile, the video abstract can be generated in a targeted manner according to the requirements of the user.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings. Fig. 1 illustrates a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented. As shown in FIG. 1, the example environment 100 includes a video 110, a computing device 120, and a video summary 130 generated via the computing device 120.

In some embodiments, video 110 may be any video. For example, the video 110 may be a video stored locally by the computing device or an externally input video, for example, a video downloaded from a network, such as a variety video, and the like. The computing device 120 processes the video 110 to generate a video summary 130.

In some embodiments, the computing device 120 may be equipped to extract pictures from the video. For example, pictures are extracted from the video by a screenshot function carried by video software in the computing device 120. In addition, in some embodiments, the computing device 120 may also use other related software to extract the desired picture directly from the video. In some embodiments, computing device 120 may also be external to an image capture device to capture images (also referred to herein as pictures) in a video. Computing device 120 then analyzes the extracted picture, determines the primary content in the picture, and structures the content information to generate structured information. The required structured information entries may then be retrieved based on predetermined configuration information to generate the video summary 130.

In some embodiments, computing device 120 may include, but is not limited to, a personal computer, a server computer, a handheld or laptop device, a mobile device (such as a mobile phone, a personal digital assistant, PDA, a media player, etc.), a consumer electronics product, a minicomputer, a mainframe computer, a cloud computing resource, and the like.

It should be understood that the description of the structure and functionality of the example environment 100 is for illustrative purposes only and is not intended to limit the scope of the subject matter described herein. The subject matter described herein may be implemented in various structures and/or functions.

The technical solutions described above are only used for illustration and do not limit the invention. It should be appreciated that the example environment 100 may also have other various time sharing approaches. To more clearly explain the principles of the disclosed solution, the process of model training will be described in more detail below with reference to fig. 2.

Fig. 2 shows a flow diagram of a process 200 for generating a video summary according to an embodiment of the present disclosure. In some embodiments, process 200 may be implemented in computing device 120 of FIG. 1. A process 200 for generating a video summary according to an embodiment of the disclosure is now described with reference to fig. 2 in conjunction with fig. 1. For ease of understanding, the specific examples set forth in the following description are intended to be illustrative, and are not intended to limit the scope of the disclosure.

At 202, computing device 120 acquires at least one image of a frame in a video. For example, the computing device 120 may extract multiple frames of images (or referred to as pictures) in a video from the video 110 stored locally or from outside. As previously mentioned, the computing device 120 may extract images from the video 110 in a variety of ways. Automatically extracting images from the video 110 by the computing device 120 eliminates the need to manually view the video 110 and significantly improves efficiency over manually viewing the video 110.

In some embodiments, the video 110 may be frame-recognized at a predetermined period, i.e., a frame of the video 110 is decimated to obtain at least one frame of image. The acquired image is used for subsequent operations such as image recognition. In some embodiments, frame identification (i.e., frame extraction) is performed on the order of seconds. Different time levels can be adopted for frame identification according to actual needs. The higher the frequency of frame identification, the more images that are obtained, the more information that can be obtained for the video 110, facilitating a more accurate description of the video 110.

Further, in some embodiments, the acquired at least one frame of image is normalized to generate a normalized image having the same dimensions. For example, all videos 110 may be made 1920x1080 in size. This is merely an example, and of course the video 110 may be sized as desired. By normalizing the different videos 110 to videos 110 having the same size, subsequent processing can be facilitated. For example, since all the images have the same size, the same judgment standard can be adopted in the subsequent image recognition conveniently, and the recognition accuracy is improved.

At 204, the computing device 120 may determine content information for the at least one frame of image, the content information representing primary content contained in the at least one frame of image. The primary content is the portion of the image that is easily attended to or otherwise of interest to the viewer. Such as people or objects in a prominent position in the image, people or animals in a motion gesture, captions, subtitles in the image, and so forth. The computing device 120 may perceive content in the image, resulting in content information, i.e., various information displayed in the image. For example, objects such as people, houses, cars, etc. present in the image, as well as various actions, scenes, etc., may be perceived by the computing device 120.

In some embodiments, determining the content information of the at least one frame of image comprises: identifying information of an object in at least one frame of image based on a deep learning method; and using the information of the identified object as content information. For example, the computing device 120 may be equipped with a target detection function for sensing, analyzing, and delineating objects present in the picture. For example, the position information of the person, and other object information: the location of the car, the house, and some other salient object. Identifying, by the computing device 120, information of objects in at least one frame of image provides important information in the video 110 for subsequent generation of a summary of the video 110. In this manner, the need to manually view and obtain information from the video 110 is obviated.

In some embodiments, the computing device 120 is equipped with OCR (Optical Character Recognition) functionality for recognizing textual information in the picture, including titles, subtitles, descriptions of highlights, and the like. For example, the variety videos 110 often appear in the form of a dialog, and acquiring subtitle information is a key for ensuring a complete description of the following description.

In some embodiments, the computing device 120 is provided with celebrity identification functionality, primarily for identifying celebrities present in the picture. For a variety of videos, a well-known character is usually involved. It becomes meaningful to identify the celebrities present in the picture and to describe them in a video summary. In some embodiments, for low confidence recognition results, a further determination of whether to retain may be made in conjunction with the title of the video. For example, in some embodiments, the confidence level of the identified celebrity may be below a predetermined threshold for various reasons (e.g., insufficient image sharpness, face angle), which may be further determined in conjunction with the title of the video. As some information is usually given in the title of the video. The reliability of the low confidence result can be further determined by combining with the header to determine whether to retain or discard the recognition result.

In some embodiments, the computing device 120 has a face recognition function, and is mainly used for recognizing face information existing in the picture, and for a face with low confidence, the face may be further determined by combining with other information, for example, the time, frequency, position information, etc. of the face.

In some embodiments, identifying the object in the at least one frame of image includes identifying a face in the image. In the case where the confidence of the recognized face is higher than a threshold, the face is determined to be a face. In a case where the confidence of the recognized face is lower than a threshold, it is determined whether the recognized face is determined to be a face based on other information about the recognized face in the content information. In this way, by combining the confidence of face recognition with other information in the image, the face can be recognized more accurately, reducing the possibility of misjudgment.

In some embodiments, determining the content information of the at least one frame of image comprises: and recognizing the face in at least one frame of image, and generating a face seed set, wherein the face seed set comprises the unique face which appears in at least one frame of image and corresponds to the non-repetitive person. And determining a time line of each nonrepeating person in the video based on the time point of each face in the face seed set.

In some embodiments, the computing device 120 may track timeline information for the occurrence of various characters in the video 110. In some embodiments, by tracking timeline information of the appearance of each character in the video 110, the motion context of each character in the scene of the video 110 can be understood, thereby being helpful for describing information such as the plot of the video 110.

In some embodiments, a set of face seeds is generated in tracking timeline information of the occurrence of individual characters in the video 110. Generating the face seed set comprises: and identifying a first face in at least one frame of image as a first seed in the face seed set. For example, first, starting from the beginning of the video 110 (which may be in other manners), a first face is obtained as a first sub-face. Then, matching the subsequently identified second face with the first face; in the case of matching, if the matching is successful, the second face is marked as the first face. In some embodiments, in order to generate a good set of face seeds, the first face seed is a face with a high degree of recognizability of the second face and the first face (described further below). That is, in generating the seed set, individual seeds in the seed set may be dynamically updated, rather than being immutable. And under the condition of no match, adding the second face as a second seed into the seed set. And then, identifying and matching a third face. And repeating the process, and sequentially matching the subsequently recognized face with all the seed faces in the face seed set. By matching the faces recognized at different time points pairwise, a face seed set is finally obtained, and the faces are the only faces appearing in the video, namely, only one recognized face exists in the seed set for different people.

As mentioned above, in order to generate a good quality face seed set, a face with a high degree of recognizability among different faces of the same person is recognized as a face seed. For high-quality face recognition, the seed face should be a representative face, such as a front face, a face of a proper size, and the like. Therefore, the score of the high-quality face needs to be weighted in a combined manner according to the information such as the angle of the face, the proportion of the face in the video, the completeness of the face, the position of the face in the video and the like. Then, the human faces are sorted according to the scores, and the human face arranged at the top is used as a human face seed. In this way, a good quality face seed can be obtained. By adopting the high-quality face seeds, the newly recognized face can be easily compared with the high-quality face seeds (namely, matching operation) in the subsequent face recognition, and the recognition efficiency and reliability are improved.

In some embodiments, the recognized face is aligned with the celebrity recognition, i.e., the recognized face corresponds to the recognized celebrity. And in the case that the recognized face is a non-celebrity, the face is also corresponding to the person to determine information such as a corresponding time line. In some embodiments, the time points of the appearance of the faces are integrated to form a time line of the appearance of the person.

In some embodiments, the method for generating the video summary further comprises extracting audio from the video and performing speech recognition on the audio to generate text information of the speech. For example, the computing device 120 separates the audio portion from the video 110 while performing audio segmentation for speech that is longer in the video 110. In some embodiments, the segmentation is performed using a time-sharing approach. Time-sharing is simply to divide the audio in the video 110 at predetermined periods. For example, every 5 minutes, i.e., the audio is divided into 5 minutes of audio. The time-sharing method has the advantage of simple and quick operation.

In certain embodiments, the segmentation is performed using spectral methods. The spectrum method refers to a method of determining the signal performance by observing the change of the amplitude of a signal with frequency characteristics (amplitude-frequency characteristics). There are cases where the phase-to-frequency characteristic of the signal is also included (phase-frequency characteristic). Therefore, in some embodiments, the audio signal is segmented by observing the characteristics of the audio signal by using a spectrum method, so that the segmented audio is ensured to be a complete sentence.

In some embodiments, the results of speech recognition are subjected to preliminary screening, statistics, and rounding operations. For example, irrelevant information such as advertisement information in the voice recognition text is filtered. Performing word segmentation, keyword and entity recognition on the text subjected to voice recognition; i.e. to distinguish words, keywords, entities in the speech recognized text. For example, people, animals, cars, houses, and other various objects in the video 110 are identified.

In some embodiments, the text is arranged in a chronological order, and the speech text is aligned with the temporal position of the segment of speech in the video 110. I.e., consistent in time with the speech in the original video 110.

In this way, by extracting audio from the video 110 and performing speech recognition on the audio to generate text information of speech, it is possible to combine text information obtained from the audio with text information obtained from an image of the video 110 in subsequent processing, thereby more accurately recognizing content in the video 110.

In some embodiments, determining the content information of the at least one frame of image further comprises: matching the text information with the content information in a corresponding time period; and determining the successfully matched part as effective content information. In this way, by matching the text information obtained from the audio with the text information obtained from the video image, the accuracy of the recognized content can be ensured.

In some embodiments, the results of recognition by OCR are matched with the results of speech recognition. The subtitle is considered valid if the text information of the time point at which the corresponding OCR is located is exactly in the speech text, i.e. the speech recognized text and the OCR recognized text can correspond to each other. Otherwise, the result may be discarded. In this way, the reliability of the recognized subtitles can be ensured. In some embodiments, the results may not simply be discarded even if the speech recognized text and the OCR recognized text do not correspond exactly. But may for example be based on the text recognized by the OCR while giving it a certain confidence, e.g. 80%, i.e. the recognition result has a confidence level of 89%. At this point, the identification solution may be further verified in conjunction with other information. The above-described manner is merely an example, and the embodiments of the present disclosure are not limited to the above-described manner, but may be variously modified.

At 206, computing device 120 may perform structured processing on the content information to generate structured information. The structured information may be structured information in a predetermined format to represent content information of various modalities. Structured information, also referred to as structured data, is generally data logically represented and implemented by a two-dimensional table structure, strictly following data format and length specifications, and is primarily stored and managed by a relational database. The identified content information is originally unordered. In some embodiments, by generating the structured information from the identified content, the corresponding structured information may be extracted as needed in the subsequent process of generating the view summary. In this way, the subsequent generation of the video summary is facilitated.

In some embodiments, generating structured information based on the content information comprises: carrying out structuralization processing on the text information to generate structuralized text information; and aligning the structured textual information with a corresponding time segment of the video. In this way, text obtained through speech recognition can be associated with a timeline in a video, facilitating subsequent generation of a video summary.

In some embodiments, generating the structured information comprises: acquiring the speaking content of the character in the video 110 based on the text information and the content information; matching the acquired speaking content with key information in the content information, and determining the association relationship between each key information and the character; and combining the key information based on the association relationship and the timeline to obtain the structured information of the video 110. In this way, text information obtained by speech recognition can be combined with text information obtained by image recognition, so that the reliability of recognition can be improved.

At 208, the computing device 120 may generate a summary of the video 110 based on the structured information. For example, after obtaining more complete structural information of the video 110, the data can be extracted and organized according to a well-established standard as required. In some embodiments, the basic standard information for generating the summary is obtained from the configuration information of the video type (which is preset configuration information, such as a configuration template, which contains various fields, such as video name, main person, video category, era, etc.). For example, for a game-like art video, it may be desirable for the video description fields to include: participant, name of competition, rule of competition, fruit competition, etc. Corresponding information is extracted from the structured information according to the fields of the video description. If there is no pre-specified configuration information, only the title or all information in the video may be given:

in some embodiments, generating the summary of the video 110 based on the structured information includes: and acquiring corresponding information from the structural information based on predetermined configuration information, wherein the predetermined configuration information specifies the type of the structural information to be acquired. For example, different configuration information can be set according to different user requirements, and different types of structured information can be specified. For example, for a variety video, the celebrity information therein can be particularly highlighted. And for detective video, the scene of the serendipity can be particularly highlighted. The corresponding information is then organized based on a predetermined template or through a deep learning based language model (e.g., a language model such as gpt 3) to form a video summary.

In some embodiments, the extraction of the relevant description is performed from the structured data according to the aforementioned standard information. And organizing the extracted result to form the abstract of the video. In some embodiments, the previously generated structured data may be used directly. In some embodiments, if a piece of text is to be generated separately, the following scheme may be used: template-based methods such as first introduction of contestants, second introduction of rules of the game, third introduction of performance, etc. in text. Or may be based on deep learning methods such as using a statistical language model such as gpt3 or the like for controlled text generation. For example, "three", "road", "car" are structured information. The controllable text generation may be for example: zhang III drives automobile on the road. In this way, the video summary can be generated flexibly and efficiently.

Through the embodiment, the method for generating the video abstract is provided. On one hand, the video abstract generation method can efficiently and automatically generate video abstract for various videos, and avoids or obviously reduces manual participation, so that a large amount of labor cost is reduced. On the other hand, the video abstract can be generated in a targeted and flexible manner according to the requirements of the user, and the satisfaction degree of the user is improved.

Fig. 3 shows a block diagram of an apparatus for generating a video summary according to an embodiment of the present disclosure. As shown in fig. 3, the apparatus 300 may include: an image acquisition module 310 configured to acquire at least one frame of image in a video; a content information determination module 320 configured to determine content information of the at least one frame of image, the content information representing main content contained in the at least one frame of image; a structured information generation module 330 configured to perform structured processing on the content information to generate structured information; and a video summary generation module 340 that generates a summary of the video based on the structured information.

In some embodiments, the image acquisition module 310 may include: the image recognition module is configured to perform frame recognition on the video at a preset period so as to acquire at least one frame of image; and the image normalization module is configured to perform normalization processing on the acquired at least one frame of image so as to generate a normalized image with the same size.

In some embodiments, the content information determination module 320 may include: an object information identification module configured to identify information of an object in at least one frame of image based on a deep learning method; and an object information determination module configured to take information of the identified object as content information.

In some embodiments, the object information identification module may include: a face recognition module configured to recognize a face in the image. Determining the face as a human face under the condition that the confidence of the recognized face is higher than a threshold value; and determining whether to determine the recognized face as the face based on other information about the recognized face in the content information in a case where the confidence of the recognized face is lower than a threshold.

In some embodiments, the content information determination module may include: the face seed set generation module is configured to identify faces in at least one frame of image and generate a face seed set, wherein the face seed set comprises unique faces which are appeared in at least one frame of image and correspond to non-repeated persons; and the time line determining module is configured to determine time lines of the non-repeated persons appearing in the video based on the time points of the faces appearing in the face seed set.

In some embodiments, the face seed set generation module may include: the face recognition module is configured to recognize a first face in at least one frame of image as a first seed in the face seed set; the face matching module is configured to match a second face which is identified subsequently with the first face; under the condition of matching, the face with high recognizability in the second face and the first face is used as a first face seed; under the condition of no match, adding a second face serving as a second seed into the seed set; and repeating the process, and sequentially matching the subsequently identified face with all the seed faces in the face seed set to generate the face seed set.

In some embodiments, the means for generating a video summary further comprises: an audio extraction module configured to extract audio from the video; and the voice recognition module is configured to perform voice recognition on the audio to generate text information of the voice.

In some embodiments, the content information determination module may include: the matching module is configured to match the text information with the content information in a corresponding time period; and a valid content information determination module configured to determine the successfully matched portion as valid content information.

In some embodiments, the structured information generation module may include: the structured text information generating module is configured to perform structured processing on the text information to generate structured text information; and an alignment module configured to align the structured textual information with a corresponding time period of the video.

In some embodiments, the structured information generation module may include: the speaking content acquisition module is configured to acquire the speaking content of the character in the video based on the text information and the content information; the key information matching module is used for matching the acquired speaking content with key information in the content information and determining the incidence relation between each piece of key information and a person; and the key information combination module is configured to combine the key information based on the incidence relation and the time line so as to obtain the structural information of the video.

In some embodiments, the video summary generation module comprises: the structured information acquisition module is configured to acquire corresponding information from the structured information based on predetermined configuration information, wherein the predetermined configuration information specifies the type of the structured information to be acquired; and a structured information organization module configured to organize the corresponding information based on a predetermined template or through a language model based on deep learning to form a video summary.

Through the embodiment, the device for generating the video abstract is provided. On one hand, the device for generating the video abstract of the embodiment of the disclosure can efficiently and automatically generate the video abstract for various videos, thereby avoiding or remarkably reducing manual participation and further reducing a large amount of labor cost. On the other hand, the video abstract can be generated in a targeted and flexible manner according to the requirements of the user, and the satisfaction degree of the user is improved.

In order to more clearly show the technical solution of the present disclosure, one of the specific embodiments according to the present disclosure will be described below with reference to the drawings. Fig. 4 shows a schematic block diagram of an apparatus for generating a video summary according to an embodiment of the present disclosure.

As shown in fig. 4, in some embodiments, an apparatus 400 for generating a video summary may comprise: an image acquisition module 310 configured to acquire at least one frame of image in a video. And a sensing module 440 configured to sense content information in at least one frame of image. In some embodiments, the sensing module 440 includes an object detecting unit 442, and the object detecting unit 442 is configured to detect an object in the image, such as a person, an animal, and other various objects. Also included in the perception module 440 is an OCR unit 444 configured to recognize textual information, such as subtitle information, in an image. In some embodiments, a celebrity identification unit 446 is also included in the perception module 440, the celebrity identification unit 446 being configured to identify celebrities in the image, which is particularly suitable for identification of art-wide videos. In some embodiments, the sensing module 440 further comprises a face recognition unit 448 configured to recognize a face in the image. In some embodiments, the apparatus 400 may include a speech extraction module 410, the speech extraction module 410 configured to extract audio from the video. In some embodiments, apparatus 400 may include a speech text generation module 420, speech text generation module 420 configured to perform speech recognition on the audio extracted by speech extraction module 410 to generate textual information for the speech. A key content obtaining module 430 configured to obtain key content information from the speech text generated by the speech text generating module 420 and from the content information perceived by the perceiving module 440. In some embodiments, the apparatus 400 may include a people tracking module 450, the people tracking module 450 configured to track timeline information for people in a video. And a character content matching module 460 configured to fuse the character with the key information acquired previously. In some embodiments, the apparatus 400 may include a video information configuration module 470 configured to preset configuration information of the video summary. For example, various fields contained in the configuration template, such as video name, main character, video category, era, etc. In some embodiments, the apparatus 400 may include a content organization module 480, and the content organization module 480 is configured to organize the previously generated structured information, for example, according to the information configured by the video information configuration module 470, extract relevant descriptions from the structured data, arrange the descriptions in a predetermined format, and so on. In some embodiments, apparatus 400 may include a video summary generation module 340 configured to generate a video summary based on structured information output from content organization module 480.

In some embodiments, the perception module 440 is configured to implement character-OCR-speech matching, combine points in time when a character appears with speech, and determine speech information of the character by recognizing sounds through a deep learning model. And then determining the speaking content of the character according to the recognized voice and the subtitle information recognized by the OCR. And matching the determined speaking content with the extracted key information, and determining the attribution of each key information, namely determining the person or persons to which each key information belongs. In addition, the key information is recombined according to the time and the character lines, and all structural information of the video content is obtained. By setting the structured information, corresponding structured information can be conveniently extracted according to needs in the subsequent process of generating the visual abstract.

Fig. 5 illustrates a block diagram of a computing device 500 capable of implementing multiple embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 500 comprises a computing unit 501 which may perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 505 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 501 performs the various methods and processes described above, such as the

processes

200, 300. For example, in some embodiments, the

processes

200, 300 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by the computing unit 501, one or more steps of the

processes

200, 300 described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the

processes

200, 300 in any other suitable manner (e.g., by way of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of generating a video summary, comprising:

acquiring at least one frame of image in a video;

determining content information of the at least one frame of image, wherein the content information represents main content contained in the at least one frame of image;

performing structuring processing on the content information to generate structured information; and

and generating the video abstract based on the structural information.

2. The method of claim 1, wherein acquiring at least one image in a video comprises:

performing frame recognition on the video at a preset period to acquire at least one frame of image; and

and carrying out normalization processing on the acquired at least one frame of image to generate a normalized image with the same size.

3. The method of claim 1, wherein determining content information for the at least one frame of image comprises:

identifying information of an object in the at least one frame of image based on a deep learning method; and

and using the identified information of the object as the content information.

4. The method of claim 3, wherein identifying an object in the at least one frame of image comprises:

identifying a face in the image;

determining the face as a face if the confidence of the identified face is higher than a threshold; and

in a case where the confidence of the recognized face is lower than a threshold, it is determined whether the recognized face is determined to be a face based on other information about the recognized face in the content information.

5. The method of claim 1, wherein determining content information for the at least one frame of image comprises:

identifying the face in the at least one frame of image, and generating a face seed set, wherein the face seed set comprises the unique face which appears in the at least one frame of image and corresponds to a non-repeated person; and

and determining a time line of each unrepeated person appearing in the video based on the time point of each face appearing in the face seed set.

6. The method of claim 5, wherein generating a set of face seeds comprises:

identifying a first face in the at least one frame of image as a first seed in the face seed set;

matching a subsequently recognized second face with the first face;

under the condition of matching, using the face with high recognizability in the second face and the first face as the first face seed;

under the condition of no match, adding the second face serving as a second seed into the seed set; and

and repeating the process, and sequentially matching the subsequently identified face with all the seed faces in the face seed set to generate the face seed set.

7. The method of any of claims 1 to 6, further comprising:

extracting audio from the video; and

and performing voice recognition on the audio to generate text information of voice.

8. The method of claim 7, wherein determining content information for the at least one frame of image comprises:

matching the text information with the content information in a corresponding time period; and

and determining the successfully matched part as the effective content information.

9. The method of claim 7, wherein generating structured information based on the content information comprises:

carrying out structuralization processing on the text information to generate structuralized text information; and

aligning the structured textual information with a corresponding time period of the video.

10. The method of claim 7, wherein generating structured information comprises:

acquiring the speaking content of the character in the video based on the text information and the content information;

matching the obtained speaking content with key information in the content information, and determining the incidence relation between each key information and the character; and

and combining the key information based on the incidence relation and the time line to obtain the structural information of the video.

11. The method of claim 1, wherein generating the summary of the video based on the structured information comprises:

acquiring corresponding information from the structural information based on predetermined configuration information, wherein the type of the structural information to be acquired is specified in the predetermined configuration information; and

organizing the corresponding information based on a predetermined template or through a language model based on deep learning to form the video summary.

12. An apparatus for generating a video summary, comprising:

the image acquisition module is configured to acquire at least one frame of image in a video;

a content information determination module configured to determine content information of the at least one frame of image, the content information representing main content contained in the at least one frame of image;

a structured information generation module configured to perform structured processing on the content information to generate structured information; and

and the video abstract generating module generates the abstract of the video based on the structural information.

13. The apparatus of claim 12, wherein the image acquisition module comprises:

an image recognition module configured to perform frame recognition on the video at a predetermined cycle to acquire the at least one frame of image; and

the image normalization module is configured to perform normalization processing on the acquired at least one frame of image to generate a normalized image with the same size.

14. The apparatus of claim 12, wherein the content information determination module comprises:

an object information identification module configured to identify information of an object in the at least one frame of image based on a deep learning method; and

an object information determination module configured to take information of the identified object as the content information.

15. The apparatus of claim 14, wherein the object information identification module comprises:

a face recognition module configured to recognize a face in the image; wherein, in case that the confidence of the recognized face is higher than a threshold, the face is determined as a face; and determining whether the recognized face is determined to be a face based on other information about the recognized face in the content information in a case where the confidence of the recognized face is lower than a threshold.

16. The apparatus of claim 12, wherein the content information determination module comprises:

a face seed set generation module configured to identify a face in the at least one frame of image, and generate a face seed set, where the face seed set includes a unique face corresponding to a non-repetitive person appearing in the at least one frame of image; and

a timeline determination module configured to determine a timeline of occurrences of the non-repetitive persons in the video based on time points of occurrences of respective faces in the face seed set.

17. The apparatus of claim 16, wherein the face seed set generation module comprises:

a face recognition module configured to recognize a first face in the at least one frame of image as a first seed in the face seed set; and

a face matching module configured to match a subsequently recognized second face with the first face; under the condition of matching, using the face with high recognizability in the second face and the first face as the first face seed; under the condition of no match, adding the second face serving as a second seed into the seed set; and repeating the process, and sequentially matching the subsequently identified face with all the seed faces in the face seed set to generate the face seed set.

18. The apparatus of any of claims 12 to 17, further comprising:

an audio extraction module configured to extract audio from the video; and

and the voice recognition module is configured to perform voice recognition on the audio to generate text information of voice.

19. The apparatus of claim 18, wherein the content information determination module comprises:

a matching module configured to match the text information with the content information at a corresponding time period; and

and the effective content information determining module is configured to determine the successfully matched part as the effective content information.

20. The apparatus of claim 18, wherein the structured information generation module comprises:

the structured text information generating module is configured to perform structured processing on the text information to generate structured text information; and

an alignment module configured to align the structured textual information with a respective time period of the video.

21. The apparatus of claim 18, wherein the structured information generation module comprises:

the speaking content acquisition module is configured to acquire the speaking content of the character in the video based on the text information and the content information;

the key information matching module is used for matching the obtained speaking content with key information in the content information and determining the incidence relation between each piece of key information and the character; and

a key information combining module configured to combine the key information based on the incidence relation and the timeline to obtain the structured information of the video.

22. The apparatus of claim 12, wherein the video summary generation module comprises:

the structured information acquisition module is configured to acquire corresponding information from the structured information based on predetermined configuration information, wherein the predetermined configuration information specifies the type of the structured information to be acquired; and

a structured information organization module configured to organize the corresponding information based on a predetermined template or through a deep learning based language model to form the video summary.

23. An electronic device, the electronic device comprising:

one or more processors; and

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of any one of claims 1-11.

24. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-11.

25. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-11.