WO2023173539A1

WO2023173539A1 - Video content processing method and system, and terminal and storage medium

Info

Publication number: WO2023173539A1
Application number: PCT/CN2022/089559
Authority: WO
Inventors: 潘芸倩; 叶静娴; 奚悦; 包小溪; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-03-16
Filing date: 2022-04-27
Publication date: 2023-09-21
Also published as: CN114598933B; CN114598933A

Abstract

A video content processing method and system, and a terminal and a storage medium. The method comprises: extracting audio signals and text information from videos to be processed; extracting video images from said videos, performing image analysis on the video images, so as to determine the video types of said videos, wherein the video types involve a PPT video, a single-person video and a multi-person video; and on the basis of the audio signals and the text information, extracting, by using a multi-modal video processing model, highlight clips from said videos of the different types, extracting, by using a deep neural network model, titles, summaries and tag information, which correspond to the highlight clips, and generating a short-video clipping result of said videos. The present application can generate a plurality of finely clipped short videos in one click, thereby greatly improving the efficiency of clipping and shortening the period of short-video production.

Description

A video content processing method, system, terminal and storage medium

This application claims the priority of the Chinese patent application submitted to the China Patent Office on March 16, 2022, with the application number 202210259504.X and the invention title "A video content processing method, system, terminal and storage medium", all of which The contents are incorporated into this application by reference.

Technical field

This application relates to the field of deep learning technology of artificial intelligence, and in particular to a video content processing method, system, terminal and storage medium.

Background technique

At present, rich media information represented by video has become mainstream, among which short video is the most popular form of content for Chinese consumers. Small screen, short video, and fast pace have become the development trend of the video industry.

The rapid development of the video industry has also put forward higher requirements for video processing efficiency and quality. The inventor realized that the current video content processing mainly relies on manual work. The operation threshold of content processing tools is high, the cost of talent cultivation is high, and the manual fine editing of videos takes a long time, which hinders the development of the video field to a certain extent. develop.

technical problem

This application provides a video content processing method, system, terminal and storage medium, aiming to solve the problems of high operating threshold, high talent cultivation cost and long time-consuming video editing of existing video content processing that relies on manual work. technical problem.

Technical solutions

In order to solve the above technical problems, the technical solutions adopted in this application are:

A video content processing method, including:

Extract audio signals and text information from the video to be processed;

Extract video images in the video to be processed, perform image analysis on the video images, and determine the video type of the video to be processed; the video types include PPT videos, single-person videos, and multi-person videos;

Based on the audio signal and text information, a multi-modal video processing model is used to extract the essence clips in different types of videos to be processed, and a deep neural network model is used to extract the title, abstract and label information corresponding to the essence clips, and generate the Describe the short video clipping results of the video to be processed.

Another technical solution adopted by the embodiment of the present application is: a video content processing system, including:

Multi-modal information extraction module: used to extract audio signals and text information from the video to be processed;

Video type judgment module: used to extract video images in the video to be processed, perform image analysis on the video images, and determine the video type of the video to be processed; the video types include PPT videos, single-person videos, and multi-person videos. ;

Video editing module: used to use the multi-modal video processing model to extract the essence of different types of videos to be processed based on the audio signal and text information, and use the deep neural network model to extract the essence of different types of videos to be processed. The corresponding title, abstract and tag information are used to generate the short video clipping result of the video to be processed.

Another technical solution adopted by the embodiment of the present application is: a terminal, the terminal includes a processor and a memory coupled to the processor, wherein,

The memory stores program instructions for implementing the above video content processing method;

The processor is configured to execute the program instructions stored in the memory to perform the video content processing operations.

Another technical solution adopted by the embodiments of the present application is: a storage medium that stores program instructions executable by a processor, and the program instructions are used to execute the above video content processing method.

beneficial effects

The video content processing method, system, terminal and storage medium of the embodiment of the present application adopt multi-modal video content processing technology. By extracting the audio signal and text information in the video to be processed, based on the audio signal and text information, the multi-modal video is processed. The processing model and the deep learning neural network model obtain the highlight clips in the video to be processed and the title, summary, and tag information corresponding to the highlight clips. This application adopts a full AI processing process, which can generate multiple finely edited short videos with one click, greatly improving the editing efficiency and shortening the video production cycle.

Description of the drawings

Figure 1 is a schematic flow chart of a video content processing method according to the first embodiment of the present application;

Figure 2 is a schematic flowchart of a video content processing method according to the second embodiment of the present application;

Figure 3 is a schematic structural diagram of a video content processing system according to an embodiment of the present application;

Figure 4 is a schematic structural diagram of a terminal according to an embodiment of the present application;

Figure 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application.

Embodiments of the invention

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

The terms “first”, “second” and “third” in this application are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, features defined as "first", "second", and "third" may explicitly or implicitly include at least one of these features. In the description of this application, "plurality" means at least two, such as two, three, etc., unless otherwise clearly and specifically limited. All directional indications (such as up, down, left, right, front, back...) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) , sports conditions, etc., if the specific posture changes, the directional indication will also change accordingly. Furthermore, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but optionally also includes steps or units that are not listed, or optionally also includes Other steps or units inherent to such processes, methods, products or devices.

Reference herein to "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art understand, both explicitly and implicitly, that the embodiments described herein may be combined with other embodiments.

Please refer to FIG. 1 , which is a schematic flowchart of a video content processing method according to the first embodiment of the present application. The video content processing method in the first embodiment of the present application includes the following steps:

S10: Extract audio signals and text information from the video to be processed;

S11: Extract the video image in the video to be processed, perform image analysis on the video image, and determine the video type of the video to be processed; video types include PPT video, single-person video, and multi-person video;

S12: Based on the audio signal and text information, use the multi-modal video processing model to extract the highlights of different types of videos to be processed, and use the deep neural network model to extract the titles, abstracts and corresponding titles of the highlights of different types of videos to be processed. Tag information to generate short video clip results of the video to be processed.

Please refer to FIG. 2 , which is a schematic flowchart of a video content processing method according to the second embodiment of the present application. The video content processing method in the second embodiment of the present application includes the following steps:

S20: Extract the audio signal in the video to be processed;

In this step, the audio signal extraction process is specifically: input the video to be processed into the open source framework (FFmpeg), and output the audio signal of the video to be processed through the open source framework.

S21: Perform speech-to-text processing on the extracted audio signal to generate text information of the video to be processed;

In this step, the text information is the video subtitles of the video to be processed. The specific method of generating text information is: extract speech features from the audio signal, input the extracted speech features into the trained acoustic model, and output the corresponding probability score through the acoustic model. Based on the acoustic model output results, text matching the audio signal is searched from the language model according to a certain search and matching strategy, and the text information recognition results of the video to be processed are output.

S22: Extract the video image in the video to be processed, analyze the video image, and obtain the video type of the video to be processed;

In this step, the video type includes PPT video, single-person video or multi-person video. The specific process of analyzing video images is as follows: input the extracted video images into the open source framework, extract the frames of the video images through the open source framework, classify each frame, and obtain the video type of the video to be processed.

S23: Determine whether the video to be processed is a PPT video, a single-person video, or a multi-person video. If it is a PPT video, execute S24; if it is a single-person video, execute S25; if it is a multi-person video, execute S26;

S24: Based on the audio signal and text information, use the multi-modal video processing model to extract the essence of the PPT video, and use the deep learning neural network model to output the title, summary and tag information corresponding to the essence of the segment, and generate a short video clip of the PPT video result;

In this step, the specific processing process of PPT video is as follows:

First, extract the PPT text information in the PPT video; the PPT text information extraction process includes: preprocessing the PPT video, identifying the character information in the PPT video, and correcting the recognized character information to obtain the PPT text of the PPT video information.

Secondly, based on the text information and PPT text information, a multi-modal video processing model is used to calculate the similarity of two adjacent PPT pages, and the PPT pages are filtered according to the set first similarity threshold to obtain all similarity greater than the first PPT pages with a similarity threshold; at the same time, calculate the similarity between each PPT page and the global keywords, filter the PPT pages according to the set second similarity threshold, and select all PPT pages whose similarity is less than the second similarity threshold. Page discarded. Finally, the filtered PPT pages are spliced to obtain the highlights of the PPT video and the timestamps of the highlights in the PPT video. Among them, global keywords are obtained by calculating the weight of each word in the text information and PPT text information and the correlation between each word and the current video topic; the first similarity threshold and the second similarity threshold can be set according to the actual application scenario .

Then, input the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. Then input the generated first word and the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. For the second word, this is repeated until the deep neural network model outputs the terminator, and the text information such as title and abstract corresponding to the essence fragment is obtained.

Finally, the text information and title are input into the deep neural network model, and the weight of each word in the text information and title is calculated through the deep neural network model, and the correlation between each word and the current video topic is calculated to obtain the corresponding content of the essence clip. Label Information.

S25: Based on the audio signal and text information, use the multi-modal video processing model to extract the essential clips in the single-person video, and use the deep learning neural network model to output the title, abstract and label information corresponding to the essential clips, and generate a short video of the single-person video. Video clipping results;

In this step, the specific processing process of single-person videos is as follows:

First, based on the text information, a multi-modal video processing model is used to calculate the similarity of two adjacent frames of images, and the images are filtered through the set first similarity threshold, and all images with a similarity greater than the first similarity threshold are output. ; At the same time, calculate the similarity between each frame image and the global keyword, filter all images according to the set second similarity threshold, and discard all images with a similarity less than the second similarity threshold. Finally, the filtered images are spliced to obtain the highlights of the single-person video and the timestamps of the highlights in the single-person video. Among them, the global keywords are obtained by calculating the weight of each word in the text information and the correlation between each word and the current video topic; the first similarity threshold and the second similarity threshold can be set according to the actual application scenario.

Then, input the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. Then input the generated first word and the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. For the second word, this is repeated until the deep neural network model outputs the end character, and the text information such as title and summary corresponding to the highlight clip of the single video is obtained.

S26: Based on the audio signal and text information, use the multi-modal video processing model to extract the essential clips in the multi-person video, and use the deep learning neural network model to output the title, summary and label information corresponding to the elite clips, and generate a short video of the multi-person video. Video clipping results;

In this step, the specific processing process of multi-person videos is as follows:

First, perform voiceprint recognition processing on the audio information of the multi-person video to obtain the voiceprint recognition matching results of each speaker in the multi-person video; the voiceprint recognition process is: using a noise suppression algorithm to extract effective speech in the audio information , perform voiceprint feature extraction on the extracted effective speech, model the speaker's voice based on the extracted voiceprint features, and output the voiceprint recognition matching results of each speaker.

Secondly, based on the video image and text information, a multi-modal video processing model is used to calculate the similarity of two adjacent frames of images, and all images are filtered through the set first similarity threshold, and the output similarity is greater than the first similarity. Threshold image; at the same time, calculate the similarity between each frame image and the global keyword, filter all images according to the second similarity threshold, and discard all images with a similarity less than the second similarity threshold. Finally, the filtered images are spliced to obtain the highlights of the multi-person video and the timestamps of the highlights in the multi-person video. Among them, the global keywords are obtained by calculating the weight of each word in the text information and the correlation between each word and the current video topic; the first similarity threshold and the second similarity threshold can be set according to the actual application scenario.

Then, input the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. Then input the generated first word and the text information into the deep neural network model to generate the title of the highlight clip and the first word of the summary. For the second word, this is repeated until the deep neural network model outputs the end character, and the text information such as title and summary corresponding to the highlight clip of the multi-person video is obtained.

Finally, input the text information and title together into the deep neural network model, calculate the weight of each word in the text information and title through the deep neural network model, and calculate the correlation between each word and the current video topic, and obtain the multi-person video Tag information corresponding to the essence fragment.

Based on the above, the video content processing method of the embodiment of the present application adopts multi-modal video content processing technology, by extracting the audio signal and text information in the video to be processed, based on the audio signal and text information, using the multi-modal video processing model and depth Learn the neural network model to obtain the highlight clips in the video to be processed and the title, summary and tag information corresponding to the highlight clips. This application adopts a full AI (Artificial Intelligence, artificial intelligence) processing process, which can generate multiple finely edited short videos with one click, greatly improving the editing efficiency and shortening the video production cycle. This application can support the accurate generation of intelligent keywords to ensure clear picture quality, smooth editing rhythm, and content that closely follows the highlights. It is highly scalable and has a wide range of applications. It can empower Internet pan-entertainment, online education, collaborative office, etc. in social life. various scenes.

In an optional implementation, the result of the video content processing method can also be uploaded to the blockchain.

Specifically, the corresponding summary information is obtained based on the result of the video content processing method. Specifically, the summary information is obtained by hashing the result of the video content processing method, for example, using the sha256s algorithm. Uploading summary information to the blockchain ensures its security and fairness and transparency to users. Users can download this summary information from the blockchain to verify whether the results of the video content processing method have been tampered with. The blockchain referred to in this example is a new application model of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithms. Blockchain is essentially a decentralized database. It is a series of data blocks generated using cryptographic methods. Each data block contains a batch of network transaction information and is used to verify its Validity of information (anti-counterfeiting) and generation of the next block. Blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Please refer to Figure 3, which is a schematic structural diagram of a video content processing system according to an embodiment of the present application. The video content processing system 40 in this embodiment of the present application includes:

Multi-modal information extraction module 41: used to extract audio signals and text information in the video to be processed;

Video type judgment module 42: used to extract video images in the video to be processed, perform image analysis on the video images, and determine the video type of the video to be processed; video types include PPT videos, single-person videos, and multi-person videos;

Video editing module 43: used to use multi-modal video processing models to extract the essence of different types of videos to be processed based on audio signals and text information, and use a deep neural network model to extract the corresponding essence of different types of videos to be processed. The title, summary and tag information of the video are used to generate the short video clipping result of the video to be processed.

Please refer to Figure 4, which is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .

The memory 52 stores program instructions for implementing the above video content processing method.

The processor 51 is configured to execute program instructions stored in the memory 52 to perform video content processing operations.

The processor 51 may also be called a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip with signal processing capabilities. The processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component . A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.

Please refer to FIG. 5 , which is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium in the embodiment of the present application stores program files 61 that can implement all the above methods. The program files 61 can be stored in the above storage medium in the form of software products. The read storage medium can be non-volatile or non-volatile. It may be volatile and include a number of instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the various implementation methods of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Various media that can store program code, such as Memory), magnetic disks or optical disks, or terminal devices such as computers, servers, mobile phones, and tablets.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the system embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units. The above are only embodiments of the present application, and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the description and drawings of this application, or directly or indirectly applied in other related technical fields, All are similarly included in the patent protection scope of this application.

Claims

1. A video content processing method, which includes:

Extract audio signals and text information from the video to be processed;

2. The video content processing method according to claim 1, wherein the extracting audio signals and text information in the video to be processed includes:

Input the video to be processed into an open source framework, and output the audio signal of the video to be processed through the open source framework;

Perform speech-to-text processing on the audio signal to generate text information of the video to be processed.

3. The video content processing method according to claim 2, wherein the speech-to-text processing of the audio signal is specifically:

Extract speech features from the audio signal, input the speech features into a trained acoustic model, and output the corresponding probability score through the acoustic model;

Based on the output result of the acoustic model, the text matching the audio signal is searched from the trained language model according to the search and matching strategy, and the text information recognition result of the video to be processed is output.

4. The video content processing method according to claim 1, wherein the determining the video type of the video to be processed is specifically:

Enter the extracted video image into an open source framework, and extract frames of the video image through the open source framework;

Classify each frame to obtain the video type of the video to be processed.

5. The video content processing method according to any one of claims 1 to 4, wherein when the video to be processed is a PPT video, the multi-modal video processing model is used based on the audio signal and text information. Extract the essence clips from different types of videos to be processed, and use a deep neural network model to extract the title, summary and label information corresponding to the essence clips, specifically as follows:

Extract PPT text information from the PPT video;

Based on the text information and PPT text information, a multi-modal video processing model is used to calculate the similarity of two adjacent PPT pages, and the PPT pages whose similarity is greater than the first similarity threshold are screened out; at the same time, each PPT page is calculated Based on the similarity between the page and the set global keywords, PPT pages whose similarity is less than the set second similarity threshold will be discarded;

Splice the filtered PPT pages to obtain the essence of the PPT video and the timestamp of the essence in the PPT video;

The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are then input into the deep neural network model to generate the title of the essence segment and The second word of the summary; repeat the above process to obtain the title and summary information corresponding to the essence fragment;

Enter the text information and title together into a deep neural network model, calculate the weight of each word in the text information and title through the deep neural network model, and calculate the correlation between each word and the PPT video theme, Obtain the label information corresponding to the essence fragment.

6. The video content processing method according to claim 5, wherein when the video to be processed is a single-person video, the multi-modal video processing model is used to extract different types of video to be processed based on audio signals and text information. Essence clips in the video, and a deep neural network model is used to extract the title, summary and label information corresponding to the highlights clips, specifically as follows:

According to the text information, a multi-modal video processing model is used to calculate the similarity between two adjacent frames of images, and the images whose similarity is greater than the first similarity threshold are screened out; at the same time, the similarity between each frame of image and the global keyword is calculated degree, discard images whose similarity is less than the second similarity threshold;

Splice the filtered images to obtain the highlight clips of the single-person video and the timestamps of the highlight clips in the single-person video;

The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are input into the deep neural network model together to generate the title of the essence segment. and the second word of the summary, repeat the above process to obtain the title and summary information corresponding to the highlights of the single video;

The text information and the title are input into the deep neural network model together, and the weight of each word in the text information and title is calculated through the deep neural network model, and the correlation between each word and the single-person video topic is calculated, and we get Tag information of the essence fragment.

7. The video content processing method according to claim 6, wherein when the video to be processed is a multi-person video, the multi-modal video processing model is used to extract different types of video based on audio signals and text information. The highlight clips in the video to be processed are used, and the deep neural network model is used to extract the title, summary and label information corresponding to the highlight clips. Specifically:

Perform voiceprint recognition processing on the audio information of the multi-person video to obtain the voiceprint recognition matching results of each speaker in the multi-person video;

Based on the video image and text information, a multi-modal video processing model is used to calculate the similarity between two adjacent frames of images, and the images whose similarity is greater than the first similarity threshold are screened out; at the same time, the relationship between each frame of image and the global keywords is calculated similarity, discard images whose similarity is less than the second similarity threshold;

Splice the filtered images to obtain the highlight clips of the multi-person video and the timestamps of the highlight clips in the multi-person video;

The text information is input into the deep neural network model to generate the title of the essence segment and the first word of the summary. The generated first word and the text information are then input into the deep neural network model to generate the title of the essence segment. For the title and the second word of the summary, repeat the above process to obtain the title and summary information corresponding to the highlight clips of the multi-person video;

The text information and the title are input into the deep neural network model together, and the weight of each word in the text information and title is calculated through the deep neural network model, and the correlation between each word and the current video topic is calculated to obtain the multiple Label information of the highlight clips of people’s videos.

8. A video content processing system, including:

9. A terminal, wherein the terminal includes a processor and a memory coupled to the processor, and computer readable instructions are stored in the memory. When the computer readable instructions are executed by the processor, Cause the processor to perform the following steps: extract the audio signal and text information in the video to be processed; extract the video image in the video to be processed, perform image analysis on the video image, and determine the video type of the video to be processed ; The video types include PPT videos, single-person videos and multi-person videos; based on the audio signals and text information, a multi-modal video processing model is used to extract the essence of different types of videos to be processed, and a deep neural network is used The model extracts the title, abstract and tag information corresponding to the essence clip, and generates a short video clip result of the video to be processed.

10. The terminal according to claim 9, wherein the extracting audio signals and text information in the video to be processed includes:

11. The terminal according to claim 10, wherein the speech-to-text processing of the audio signal is specifically:

12. The terminal according to any one of claims 9 to 11, wherein when the video to be processed is a PPT video, the multi-modal video processing model is used to extract different types of video based on the audio signal and text information. The highlight clips in the video to be processed are used, and a deep neural network model is used to extract the title, summary and label information corresponding to the highlight clips, specifically as follows:

Extract PPT text information from the PPT video;

13. The terminal according to claim 12, wherein when the video to be processed is a single-person video, the multi-modal video processing model is used to extract different types of video in the video to be processed based on audio signals and text information. Essence fragments, and use a deep neural network model to extract the title, abstract and tag information corresponding to the essence fragments, specifically as follows:

14. The terminal according to claim 13, wherein when the video to be processed is a multi-person video, the multi-modal video processing model is used to extract different types of videos to be processed based on audio signals and text information. The essence fragments in the extract, and use a deep neural network model to extract the title, abstract and label information corresponding to the essence fragments, specifically as follows:

15. A storage medium, in which program files that can implement the following steps are stored, the steps including: extracting audio signals and text information in the video to be processed; extracting video images in the video to be processed, and performing the following steps: Perform image analysis on video images to determine the video type of the video to be processed; the video types include PPT videos, single-person videos, and multi-person videos; based on the audio signal and text information, a multi-modal video processing model is used to extract different Types of highlight clips in the video to be processed, and a deep neural network model is used to extract the title, summary and tag information corresponding to the highlight clips, and generate a short video clipping result of the video to be processed.

16. The storage medium according to claim 15, wherein the extracting audio signals and text information in the video to be processed includes:

17. The storage medium according to claim 16, wherein the speech-to-text processing of the audio signal is specifically:

18. The storage medium according to any one of claims 15 to 17, wherein when the video to be processed is a PPT video, based on the audio signal and text information, a multi-modal video processing model is used to extract different Types of highlight clips in videos to be processed, and a deep neural network model is used to extract the title, summary and label information corresponding to the highlight clips, specifically as follows:

Extract PPT text information from the PPT video;

19. The storage medium according to claim 18, wherein when the video to be processed is a single-person video, the multi-modal video processing model is used to extract different types of video to be processed based on audio signals and text information. of the essence fragments, and use a deep neural network model to extract the title, abstract and tag information corresponding to the essence fragments, specifically as follows:

20. The storage medium according to claim 19, wherein when the video to be processed is a multi-person video, the multi-modal video processing model is used to extract different types of video to be processed based on audio signals and text information. Essence clips in the video, and a deep neural network model is used to extract the title, summary and label information corresponding to the highlights clips, specifically as follows: