CN116935287A

CN116935287A - Video understanding method and device

Info

Publication number: CN116935287A
Application number: CN202310987215.6A
Authority: CN
Inventors: 张弛; 王鹏程
Original assignee: Baidu com Times Technology Beijing Co Ltd
Current assignee: Baidu com Times Technology Beijing Co Ltd
Priority date: 2023-08-07
Filing date: 2023-08-07
Publication date: 2023-10-24

Abstract

The disclosure provides a video understanding method and device, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of computer vision, natural language processing, deep learning and the like. One embodiment of the method comprises the following steps: extracting key frames from the video to obtain key frames; establishing association between the key frames and the corresponding time stamps to obtain association relation between the key frames and the time stamps; carrying out image understanding on the key frames to obtain description texts of the key frames; establishing association between the association relation of the key frame and the time stamp and the description text of the key frame to obtain a three-dimensional mapping relation of the key frame and the time stamp and the text; and generating descriptive text of the video based on the three-dimensional mapping relation of the key frame, the time stamp and the text. This embodiment improves the versatility of video understanding.

Description

Video understanding method and device

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to the technical fields of computer vision, natural language processing, deep learning, and the like.

Background

With the explosive development of natural language processing and computer vision, video understanding is becoming a new hotspot after image understanding.

Currently, video understanding mainly includes the following two approaches: firstly, recognizing based on voice in video, converting the voice into text, and finally summarizing the text; secondly, in the scenes of object detection, human skeleton recognition, action recognition and the like in the video, the objects and people appearing in the video are recognized, and then the actions, the objects and the like of the video are described.

Disclosure of Invention

The embodiment of the disclosure provides a video understanding method, a device, equipment, a storage medium and a program product.

In a first aspect, an embodiment of the present disclosure provides a video understanding method, including: extracting key frames from the video to obtain key frames; establishing association between the key frames and the corresponding time stamps to obtain association relation between the key frames and the time stamps; carrying out image understanding on the key frames to obtain description texts of the key frames; establishing association between the association relation of the key frame and the time stamp and the description text of the key frame to obtain a three-dimensional mapping relation of the key frame and the time stamp and the text; and generating descriptive text of the video based on the three-dimensional mapping relation of the key frame, the time stamp and the text.

In a second aspect, embodiments of the present disclosure provide a video understanding apparatus, including: the key frame extraction module is configured to extract key frames of the video to obtain key frames; the first establishing module is configured to establish association between the key frames and the corresponding time stamps to obtain association relation between the key frames and the time stamps; the image understanding module is configured to perform image understanding on the key frames to obtain description texts of the key frames; the second establishing module is configured to establish association between the association relation of the key frame and the time stamp and the description text of the key frame to obtain a three-dimensional mapping relation of the key frame and the time stamp and the text; the text generation module is configured to generate descriptive text of the video based on the three-dimensional mapping relation of the key frame, the time stamp and the text.

In a third aspect, an embodiment of the present disclosure proposes an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as described in the first aspect.

In a fifth aspect, embodiments of the present disclosure propose a computer program product comprising a computer program which, when executed by a processor, implements a method as described in the first aspect.

According to the video understanding method provided by the embodiment of the disclosure, video content understanding is performed based on image understanding, and the content of the video is extracted and generalized to form a text code for retrieval by identifying images in the video. The method can be applied to various fields of video-to-text, long video-to-short video, video retrieval and the like, and improves the universality of video understanding.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of one embodiment of a video understanding method according to the present disclosure;

FIG. 2 is a flow chart of yet another embodiment of a video understanding method according to the present disclosure;

FIG. 3 is a flow chart of another embodiment of a video understanding method according to the present disclosure;

FIG. 4 is a flow chart of yet another embodiment of a video understanding method according to the present disclosure;

FIG. 5 is a scene diagram of a video understanding method in which embodiments of the present disclosure may be implemented;

FIG. 6 is a schematic diagram of the structure of one embodiment of a video understanding apparatus according to the present disclosure;

fig. 7 is a block diagram of an electronic device for implementing a video understanding method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates a flow 100 of one embodiment of a video understanding method according to the present disclosure. The video understanding method comprises the following steps:

step 101, extracting key frames from the video to obtain key frames.

In this embodiment, the execution body of the video understanding method may perform key frame extraction on the video to obtain a key frame.

Generally, the key frame extraction method mainly includes the following three methods: firstly, clustering; secondly, a lens method; thirdly, a fixed interval extraction method is adopted.

The clustering method can cluster video frames of the video to obtain a plurality of classes; and selecting partial video frames from the multiple classes respectively as key frames. The main idea of clustering is to group similar video frames together to form a class, and select part of the video frames in each class as key frames. Specifically, a metric is defined for measuring the similarity between two video frames; then, according to the measurement standard, similar video frames are gathered together to form a class; finally, at least one video frame is selected from each class as a key frame. Among these criteria for selecting video frames as key frames may be numerous, such as selecting video frames in the center of a class, or selecting the most representative video frame in a class. The redundant frames can be effectively reduced by extracting the key frames by a clustering method.

The lens method can cut the video to obtain a plurality of lens video clips; and selecting partial video frames from the plurality of shot video clips respectively as key frames. Specifically, in the shot method, firstly, a video is cut into individual shot video clips, wherein one shot corresponds to one shot video clip; at least one video frame is then selected as a key frame in each shot video clip. The criteria for selecting a video frame as a key frame may be numerous, such as selecting the first video frame of a shot video clip, selecting an intermediate video frame of a shot video clip, or selecting the most representative video frame of a shot video clip. The key frames are extracted through a shot method, and the calculated amount is relatively small. The main reason is that only video frames of a shot need to be processed, not all video frames.

The fixed interval extraction method can extract video frames of the video according to preset intervals to obtain key frames. For example, for 30fps video, 1 frame may be fixedly decimated every 5 frames; for 60fps video, 1 frame may be fixedly decimated every 10 frames, and so on.

It should be noted that, the key frame extraction method applicable to the present disclosure is not limited to the above three methods, as long as the video can be extracted according to a certain strategy, and the time stamp of the extracted sequence frame is saved.

Step 102, establishing association between the key frame and the corresponding time stamp to obtain the association relation between the key frame and the time stamp.

In this embodiment, the executing body may establish an association between the key frame and the corresponding timestamp, so as to obtain an association relationship between the key frame and the timestamp.

In the video playing process, the display time of different video frames is different, and the display time is the corresponding time stamp of the video frames. Here, a one-to-one association is established between the key frame and its corresponding timestamp, and the association relationship between the key frame and the timestamp is obtained.

And step 103, carrying out image understanding on the key frames to obtain description texts of the key frames.

In this embodiment, the execution body may perform image understanding on the key frame to obtain a description text of the key frame. The description text of the key frame may be text describing the content contained in the key frame.

By using an image understanding algorithm, the key frames can be understood, and the association between the key frames and the descriptive text thereof is established, so that the association relation between the key frames and the text is obtained. For example, the key frames are input into a pre-trained image understanding model, and descriptive text of the key frames is obtained. The Image understanding model may be a pretraining model such as CLIP (Contrastive Language-Image Pre-training, contrast language-Image pretraining) or BLIP (Bootstrapping Language-Image Pre-training, guide language Image pretraining). CLIP is a neural network trained on various (image, text) pairs, and can indicate in natural language that the most relevant text segments are predicted given an image without directly optimizing for a task. BLIP is a generic and efficient pre-training strategy that directs visual language pre-training from an off-the-shelf frozen parametric image encoder and a large language model of the frozen parameters.

And 104, establishing association between the association relation of the key frame and the time stamp and the description text of the key frame to obtain a three-dimensional mapping relation of the key frame, the time stamp and the text.

In this embodiment, the executing body may establish an association between the association relationship between the key frame and the timestamp and the description text of the key frame, so as to obtain a three-dimensional mapping relationship between the key frame and the timestamp and the text.

Because the key frames are associated with the timestamps one by one, the key frames are associated with the descriptive texts one by one, and therefore, the association between the key frames and the corresponding timestamps and descriptive texts is established, and the three-dimensional mapping relation of the key frames, the timestamps and the texts can be obtained. In this three-dimensional mapping relationship, the corresponding key frame may be obtained by a text-time stamp, or the corresponding text may be obtained by a key frame-time stamp.

Step 105, generating descriptive text of the video based on the three-dimensional mapping relation of the key frame-time stamp-text.

In this embodiment, the execution body may generate the description text of the video based on the three-dimensional mapping relationship of the key frame-timestamp-text. The description text of the video may be text describing content contained in the video.

And summarizing the content of the texts corresponding to all the key frames by using the language model, so that the descriptive text of the video can be obtained. For example, extracting the association relationship of text-time stamp from the three-dimensional mapping relationship of key frame-time stamp-text; and inputting the association relation of the text and the timestamp into a pre-trained language model to obtain the descriptive text of the video. The association relation between the text and the time stamp is a text summary of time sequence, and a mapping relation is established between the text and the video, so that understanding and summary of the video, namely, description text of the video can be obtained. The language model may be, for example, LLM (Large Language Model ). LLM is an artificial intelligence model aimed at understanding and generating human language. They train on a large amount of text data and can perform a wide range of tasks including text summarization, translation, emotion analysis, and so forth. LLMs are characterized by a large scale, containing billions of parameters, which help them learn complex patterns in linguistic data. These models are typically based on deep learning architectures, such as converters, which help them to achieve impressive performance on various NLP tasks.

With continued reference to fig. 2, a flow 200 of yet another embodiment of a video understanding method according to the present disclosure is shown. The video understanding method comprises the following steps:

step 201, extracting key frames from the video to obtain key frames.

Step 202, establishing association between the key frame and the corresponding time stamp to obtain the association relation between the key frame and the time stamp.

And 203, carrying out image understanding on the key frames to obtain description texts of the key frames.

And 204, establishing association between the association relation of the key frame and the time stamp and the description text of the key frame to obtain a three-dimensional mapping relation of the key frame, the time stamp and the text.

In step 205, descriptive text of the video is generated based on the three-dimensional mapping relationship of the key frame-timestamp-text.

In this embodiment, the specific operations of steps 201 to 205 are described in detail in steps 101 to 105 in the embodiment shown in fig. 1, and are not described herein.

Step 206, receiving video retrieval text input by a user.

In this embodiment, for a user who has a video retrieval demand, a video retrieval text may be input. Thus, the execution subject of the video understanding method can receive the video retrieval text input by the user. For example, the user may input the text "always seal swimming" in the search for a video of seal swimming.

Step 207, matching the video retrieval text with the description text of the video, and determining the matched video.

In this embodiment, the execution body may match the video search text with the description text of the video, and determine the matched video.

In general, the similarity between the video search text and the descriptive text of each video is calculated, and the video with the highest similarity is taken as the matched video.

Step 208, matching the video retrieval text with the three-dimensional mapping relation of the key frame-time stamp-text of the matched video to obtain the matched key frame and the matched time stamp.

In this embodiment, the execution body may match the video search text with the three-dimensional mapping relationship of the key frame-timestamp-text of the matched video, to obtain the matched key frame and the matched timestamp.

In general, the similarity of each text in the three-dimensional mapping relation between the video retrieval text and the matched video is calculated, and the text with the similarity higher than a preset similarity threshold value is obtained. In the three-dimensional mapping relation, the corresponding key frames and the corresponding time stamps are searched out through the text, namely, the matched key frames and the matched time stamps.

Step 209, searching for a corresponding position in the matched video based on the matched key frame and the matched timestamp.

In this embodiment, the executing body may search for a corresponding location in the matched video based on the matched key frame and the matched timestamp.

And 210, intercepting the video clips at the position and sending the video clips to a user.

In this embodiment, the executing body may intercept the video clip at the location and send the video clip to the user. For example, a start time and an end time are set before and after the searched position respectively, and a video clip between the start time and the end time is intercepted and sent to the user.

As can be seen from fig. 2, the video understanding method in this embodiment highlights the video retrieval step compared to the corresponding embodiment of fig. 1. Therefore, the scheme described in the embodiment realizes video retrieval based on video understanding, and improves the universality of video understanding.

With further reference to fig. 3, a flow 300 of another embodiment of a video understanding method according to the present disclosure is shown. The video understanding method comprises the following steps:

step 301, extracting a key frame from the video to obtain a key frame.

Step 302, establishing an association between the key frame and the corresponding timestamp to obtain an association relationship between the key frame and the timestamp.

And step 303, carrying out image understanding on the key frame to obtain the description text of the key frame.

And step 304, establishing association between the association relation of the key frame and the time stamp and the description text of the key frame to obtain the three-dimensional mapping relation of the key frame, the time stamp and the text.

In step 305, descriptive text of the video is generated based on the three-dimensional mapping relationship of the key frame-timestamp-text.

In this embodiment, the specific operations of steps 301 to 305 are described in detail in steps 101 to 105 in the embodiment shown in fig. 1, and are not described herein.

Step 306, receiving video clip text of a video input by a user.

In this embodiment, for a user who has a video clip need, a video clip text may be input. Thus, the execution subject of the video understanding method can receive the video clip text of the video input by the user. For example, a user may input the text "subtract the portion of the ship that appears in the video" when he wants to subtract the ship from the video.

Step 307, the video clip text is understood, and the video clip intention of the user is obtained.

In this embodiment, the executing body may understand the video clip text to obtain the video clip intention of the user. Wherein the video clip intent may contain information such as clip body and clip action.

Understanding the video clip text using a language model may result in a video clip intent. For example, video clip text is input to a pre-trained language model, and video clip intent may be derived. Wherein the language model may be LLM, for example. Inputting the text "subtract part of the ship appearing in the video" into the LLM, it can be analyzed that the clip body is "ship" and the clip action is "subtract".

Step 308, matching the video clip intention with the three-dimensional mapping relation of the key frame-time stamp-text of the video to obtain a matched start time stamp and end time stamp.

In this embodiment, the executing body may match the video clip intention with the three-dimensional mapping relationship of the key frame-timestamp-text of the video, to obtain the matched start timestamp and end timestamp.

In general, the similarity of each text in the three-dimensional mapping relation between the editing intention and the video is calculated, and the text with the similarity higher than a preset similarity threshold value is obtained. In the three-dimensional mapping relation, the corresponding key frames and the corresponding time stamps are searched out through the text, namely, the matched key frames and the matched time stamps. And further determining a matched start timestamp and end timestamp based on the matched keyframe and the matched timestamp. Wherein the matched key frames fall within the range defined by the start time stamp and the end time stamp.

And 309, clipping the video based on the matched start time stamp and end time stamp to obtain a clipped video.

In this embodiment, the executing body may clip the video based on the matched start timestamp and end timestamp, to obtain a clipped video. For example, 2, video is clipped by a video command such as FFMPEG, and video clips within a range defined by a start time stamp and an end time stamp are cut.

As can be seen from fig. 3, the video understanding method in this embodiment highlights the video editing step compared to the corresponding embodiment of fig. 1. Therefore, the scheme described in the embodiment realizes video editing based on video understanding, and improves the universality of video understanding.

With further reference to fig. 4, a flow 400 of yet another embodiment of a video understanding method according to the present disclosure is shown. The video understanding method comprises the following steps:

step 401, extracting key frames from the video to obtain key frames.

Step 402, establishing association between the key frame and the corresponding timestamp to obtain the association relationship between the key frame and the timestamp.

And step 403, performing image understanding on the key frame to obtain a description text of the key frame.

And step 404, establishing association between the association relation of the key frame and the time stamp and the description text of the key frame to obtain the three-dimensional mapping relation of the key frame and the time stamp and the text.

In step 405, descriptive text of the video is generated based on the three-dimensional mapping relationship of the key frame-timestamp-text.

In this embodiment, the specific operations of steps 401 to 405 are described in detail in steps 101 to 105 in the embodiment shown in fig. 1, and are not described herein.

Step 406, matching the description text of the video with the three-dimensional mapping relation of the key frame-time stamp-text of the video to obtain a matched key frame and a matched time stamp.

In this embodiment, the execution body of the video understanding method may match the description text of the video with the three-dimensional mapping relationship of the key frame-timestamp-text of the video, so as to obtain a matched key frame and a matched timestamp.

In general, calculating the similarity of the descriptive text of the video and each text in the three-dimensional mapping relation, and acquiring the text with the similarity higher than a preset similarity threshold. In the three-dimensional mapping relation, the corresponding key frames and the corresponding time stamps are searched out through the text, namely, the matched key frames and the matched time stamps.

Step 407, searching the matched video for a corresponding position based on the matched key frame and the matched timestamp, and clipping the video based on the position to obtain the target video.

In this embodiment, the executing body may search for a corresponding position in the matched video based on the matched key frame and the matched timestamp, and clip the video based on the position, to obtain the target video. For example, a start time and an end time are set before and after the found position, respectively, and a video clip between the start time and the end time is cut out as a target video frame. Wherein the length of the target video frame is much smaller than the length of the original video.

As can be seen from fig. 4, the video understanding method in this embodiment highlights the long video to short video step compared to the corresponding embodiment of fig. 1. Therefore, the scheme described in the embodiment realizes the conversion of long video into short video based on video understanding, and improves the universality of video understanding.

For ease of understanding, fig. 5 shows a scene diagram of a video understanding method in which embodiments of the present disclosure may be implemented.

First, a key frame is extracted from a video v1 to obtain a key frame image kf1, a key frame image kf2, a key frame image kf3, a key frame image kf4 and a key frame image kf5.

Second, key frame image kf1 is associated with a corresponding timestamp t1, key frame image kf2 is associated with a corresponding timestamp t2, key frame image kf3 is associated with a corresponding timestamp t3, key frame image kf4 is associated with a corresponding timestamp t4, and key frame image kf5 is associated with a corresponding timestamp t5.

Thirdly, respectively inputting a key frame image kf1, a key frame image kf2, a key frame image kf3, a key frame image kf4 and a key frame image kf5 into an image understanding model Blip for image understanding to obtain a key frame kf1- > text description txt1, a key frame kf2- > text description txt2, a key frame kf3- > text description txt3, a key frame kf4- > text description txt4 and a key frame kf5- > text description txt5.

Fourth, the text description txt1, the text description txt2, the text description txt3, the text description txt4 and the text description txt5 are respectively input into a text large model LLM to be summarized, and the text description vtxt1 of the video v1 is obtained.

With further reference to fig. 6, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a video understanding apparatus, which corresponds to the method embodiment shown in fig. 1, and which is particularly applicable to various electronic devices.

As shown in fig. 6, the video understanding apparatus 600 of the present embodiment may include: a key frame extraction module 601, a first creation module 602, an image understanding module 603, a second creation module 604, and a text generation module 605. The key frame extraction module 601 is configured to extract a key frame from the video to obtain the key frame; a first establishing module 602, configured to establish an association between the key frame and the corresponding timestamp, to obtain an association relationship between the key frame and the timestamp; an image understanding module 603 configured to perform image understanding on the key frame to obtain a description text of the key frame; a second establishing module 604, configured to establish an association between the association relationship between the key frame and the timestamp and the descriptive text of the key frame, so as to obtain a three-dimensional mapping relationship between the key frame and the timestamp and the text; the text generation module 605 is configured to generate descriptive text of the video based on the three-dimensional mapping relationship of the key frame-timestamp-text.

In the present embodiment, in the video understanding apparatus 600: specific processing of the key frame extraction module 601, the first establishing module 602, the image understanding module 603, the second establishing module 604, and the text generating module 605 and technical effects thereof may refer to the relevant descriptions of steps 101-105 in the corresponding embodiment of fig. 1, and are not repeated herein.

In some optional implementations of the present embodiment, the key frame extraction module 601 is further configured to: clustering video frames of the video to obtain a plurality of classes; and selecting partial video frames from the multiple classes respectively as key frames.

In some optional implementations of the present embodiment, the key frame extraction module 601 is further configured to: performing shot cutting on the video to obtain a plurality of shot video clips; and selecting partial video frames from the plurality of shot video clips respectively as key frames.

In some optional implementations of the present embodiment, the key frame extraction module 601 is further configured to: and extracting video frames of the video according to a preset interval to obtain key frames.

In some optional implementations of the present embodiment, the image understanding module 603 is further configured to: and inputting the key frames into a pre-trained image understanding model to obtain the description text of the key frames.

In some alternative implementations of the present embodiment, the text generation module 605 is further configured to: extracting the association relation of the text-time stamp from the three-dimensional mapping relation of the key frame-time stamp-text; and inputting the association relation of the text and the timestamp into a pre-trained language model to obtain the descriptive text of the video.

In some optional implementations of the present embodiment, the video understanding apparatus 600 further includes: a first receiving module configured to receive video retrieval text input by a user; the first matching module is configured to match the video retrieval text with the description text of the video and determine a matched video; the second matching module is configured to match the video retrieval text with the three-dimensional mapping relation of the key frame-time stamp-text of the matched video to obtain the matched key frame and the matched time stamp; a first lookup module configured to find a corresponding location in the matched video based on the matched key frame and the matched timestamp; and the first interception module is configured to intercept the video clips at the position and send the video clips to the user.

In some optional implementations of the present embodiment, the video understanding apparatus 600 further includes: a second receiving module configured to receive video clip text of a video input by a user; the second understanding module is configured to understand the video clip text to obtain the video clip intention of the user; the third matching module is configured to match the video clip intention with the three-dimensional mapping relation of the key frame-time stamp-text of the video to obtain a matched start time stamp and end time stamp; and the first clipping module is configured to clip the video based on the matched start time stamp and end time stamp to obtain a clipped video.

In some optional implementations of the present embodiment, the video understanding apparatus 600 further includes: the fourth matching module is configured to match the descriptive text of the video with the three-dimensional mapping relation of the key frame-time stamp-text of the video to obtain a matched key frame and a matched time stamp; and the second interception module is configured to search the corresponding position in the matched video based on the matched key frame and the matched timestamp, and clip the video based on the position to obtain the target video.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as a video understanding method. For example, in some embodiments, the video understanding method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the video understanding method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the video understanding method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions provided by the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A video understanding method, comprising:

extracting key frames from the video to obtain key frames;

establishing association between the key frame and the corresponding time stamp to obtain an association relation of the key frame and the time stamp;

carrying out image understanding on the key frame to obtain a description text of the key frame;

establishing association between the association relation of the key frame and the time stamp and the description text of the key frame to obtain a three-dimensional mapping relation of the key frame, the time stamp and the text;

And generating descriptive text of the video based on the three-dimensional mapping relation of the key frame, the time stamp and the text.

2. The method of claim 1, wherein the performing key frame extraction on the video to obtain a key frame comprises:

clustering video frames of the video to obtain a plurality of classes;

and selecting partial video frames from the classes respectively as the key frames.

3. The method of claim 1, wherein the performing key frame extraction on the video to obtain a key frame comprises:

performing shot cutting on the video to obtain a plurality of shot video clips;

and respectively selecting partial video frames from the plurality of shot video clips as the key frames.

4. The method of claim 1, wherein the performing key frame extraction on the video to obtain a key frame comprises:

and extracting video frames of the video according to a preset interval to obtain the key frames.

5. The method of claim 1, wherein the performing image understanding on the key frame to obtain the descriptive text of the key frame includes:

and inputting the key frame into a pre-trained image understanding model to obtain the description text of the key frame.

6. The method of claim 1, wherein the generating descriptive text of the video based on the three-dimensional mapping of key frame-timestamp-text comprises:

extracting the association relation of the text-time stamp from the three-dimensional mapping relation of the key frame-time stamp-text;

and inputting the association relation of the text and the timestamp into a pre-trained language model to obtain the descriptive text of the video.

7. The method of any of claims 1-6, wherein the method further comprises:

receiving video retrieval text input by a user;

matching the video retrieval text with the description text of the video, and determining a matched video;

matching the video retrieval text with the three-dimensional mapping relation of the key frame-time stamp-text of the matched video to obtain a matched key frame and a matched time stamp;

searching a corresponding position in the matched video based on the matched key frame and the matched timestamp;

and intercepting the video clips at the position and sending the video clips to the user.

8. The method of any of claims 1-6, wherein the method further comprises:

Receiving video clip text of the video input by a user;

understanding the video clip text to obtain the video clip intention of the user;

matching the video editing intention with the three-dimensional mapping relation of the key frame-time stamp-text of the video to obtain a matched starting time stamp and a matched ending time stamp;

and editing the video based on the matched start time stamp and end time stamp to obtain an editing video.

9. The method of any of claims 1-6, wherein the method further comprises:

matching the description text of the video with the three-dimensional mapping relation of the key frame-time stamp-text of the video to obtain a matched key frame and a matched time stamp;

searching a corresponding position in the matched video based on the matched key frame and the matched timestamp, and editing the video based on the position to obtain a target video.

10. A video understanding apparatus comprising:

the key frame extraction module is configured to extract key frames of the video to obtain key frames;

the first establishing module is configured to establish association between the key frames and the corresponding time stamps to obtain association relation between the key frames and the time stamps;

The image understanding module is configured to perform image understanding on the key frames to obtain description texts of the key frames;

the second establishing module is configured to establish association between the association relation of the key frame and the time stamp and the description text of the key frame to obtain a three-dimensional mapping relation of the key frame and the time stamp and the text;

and the text generation module is configured to generate descriptive text of the video based on the three-dimensional mapping relation of the key frame, the time stamp and the text.

11. The apparatus of claim 10, wherein the key frame extraction module is further configured to:

clustering video frames of the video to obtain a plurality of classes;

12. The apparatus of claim 10, wherein the key frame extraction module is further configured to:

performing shot cutting on the video to obtain a plurality of shot video clips;

13. The apparatus of claim 10, wherein the key frame extraction module is further configured to:

14. The apparatus of claim 10, wherein the image understanding module is further configured to:

15. The apparatus of claim 10, wherein the text generation module is further configured to:

16. The apparatus of any of claims 10-15, wherein the apparatus further comprises:

a first receiving module configured to receive video retrieval text input by a user;

the first matching module is configured to match the video retrieval text with the description text of the video and determine a matched video;

the second matching module is configured to match the video retrieval text with the three-dimensional mapping relation of the key frame-time stamp-text of the matched video to obtain a matched key frame and a matched time stamp;

A first lookup module configured to find a corresponding location in the matched video based on the matched keyframe and the matched timestamp;

and the first intercepting module is configured to intercept the video clips at the position and send the video clips to the user.

17. The apparatus of any of claims 10-15, wherein the apparatus further comprises:

a second receiving module configured to receive video clip text of the video input by a user;

the second understanding module is configured to understand the video clip text to obtain the video clip intention of the user;

the third matching module is configured to match the video clip intention with the three-dimensional mapping relation of the key frame-time stamp-text of the video to obtain a matched start time stamp and an end time stamp;

and the first clipping module is configured to clip the video based on the matched start time stamp and end time stamp to obtain a clipped video.

18. The apparatus of any of claims 10-15, wherein the apparatus further comprises:

the fourth matching module is configured to match the descriptive text of the video with the three-dimensional mapping relation of the key frame-time stamp-text of the video to obtain a matched key frame and a matched time stamp;

And the second intercepting module is configured to search a corresponding position in the matched video based on the matched key frame and the matched timestamp, and clip the video based on the position to obtain a target video.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-9.