CN114385859A

CN114385859A - Multi-modal retrieval method for video content

Info

Publication number: CN114385859A
Application number: CN202111631648.5A
Authority: CN
Inventors: 张华平; 商建云; 孙婧婧
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-22

Abstract

The invention discloses a multi-modal retrieval method for video content, and belongs to the technical field of multimedia analysis and processing. The method is oriented to video content, extracts and converts multi-modal features in video data and retrieval data, converts the multi-modal features into text features, and then retrieves and positions the video content by retrieving the text features. The method makes full use of the multi-modal characteristics in the video content, supports retrieval by using multi-modal data, facilitates retrieval of the video content by a user in various modes, and enables the retrieval of the video content to be more accurate and comprehensive under the condition of no mark. The method greatly facilitates the user to quickly find the content of interest in the massive video data in various forms, can find the content of interest in the massive non-labeled video, saves the time for manually watching the video for screening, can be used for screening the sensitive content of the video, locking a target person and the like, and realizes the high-efficiency utilization of the data.

Description

Multi-modal retrieval method for video content

Technical Field

The invention relates to a multi-modal retrieval method for video content, and belongs to the technical field of multimedia analysis and processing.

Background

Video, as a storage format for moving images, can be transmitted over the network at an extremely high speed. With the increasing dependence on networks, the number of videos generated and consumed by people per day is increasing, which means that the cost for people to find the content of interest per se from the massive videos is also increasing. How to search video contents more efficiently becomes an important research content.

In the aspect of searching for video content, many video websites provide related technologies at present, but most of the video websites are based on text features such as video names and video content keywords, and the multimodal features contained in the video content are not fully utilized, so that the requirements of users on local and detailed searching of the video content cannot be met.

Disclosure of Invention

The invention aims to creatively provide a multi-mode retrieval method facing video contents, aiming at the technical problems that the current video quantity is large, the contents are complicated, the multi-mode characteristics in the video contents cannot be fully utilized by the existing video retrieval technology, and the local retrieval cannot be carried out aiming at the video contents. The method is oriented to video content, extracts and converts multi-modal features in video data and retrieval data, converts the multi-modal features into text features, and then retrieves and positions the video content by retrieving the text features, so that efficient video content retrieval is realized, and the local retrieval requirements of users on the video content can be met.

The method has the innovation points that: based on the multi-modal characteristics in the video content, the multi-modal characteristics in the data are expressed as text characteristics through means of text recognition, voice recognition, image recognition and the like, so that multi-modal retrieval and positioning of the video content are realized. The data includes video data and multimodal retrieval data, and the types of retrieval data include text, voice and images. The multi-modal features include audio features, text features, image features.

Advantageous effects

Compared with the prior art, the method of the invention has the following advantages:

1. the method makes full use of the multi-modal characteristics in the video content, supports retrieval by using multi-modal data, facilitates retrieval of the video content by a user in various modes, and enables the retrieval of the video content under the condition of no mark to be more accurate and comprehensive.

2. The method greatly facilitates the user to quickly find the content of interest in the massive video data in various forms, can find the content of interest in the massive non-labeled video, saves the time for manually watching the video for screening, can be used for screening the sensitive content of the video, locking the target character and the like, and realizes the high-efficiency utilization of the data.

The method can be applied to multiple fields of multimedia analysis and processing, information retrieval, internet data analysis and mining and the like.

Drawings

FIG. 1 is a system architecture of the method of the present invention;

FIG. 2 is a process flow of the method of the present invention;

FIG. 3 shows the video data pre-processing flow of step 1 of the method and embodiment 1 of the present invention;

FIG. 4 is a flowchart of video feature extraction in step 2 and embodiment 1 of the present invention;

FIG. 5 is a flow chart of the video feature transformation of step 3 of the method of the present invention and embodiment 1;

FIG. 6 is a flowchart of the transformation of the characteristics of the retrieved data according to step 4 of the method of the present invention and example 1;

fig. 7 is a flowchart of the method of step 5 and the node where the corresponding video content is retrieved according to embodiment 1 of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention will be made with reference to the accompanying drawings and examples. It should be understood that the specific examples are for purposes of illustration only and are not intended to limit the invention.

The invention is realized by the following technical scheme.

A multi-modal retrieval method for video content comprises the following steps:

step 1: video data is pre-processed.

The video needing to be analyzed and retrieved is preprocessed based on the characteristic of characteristic multimode, and the method specifically comprises the following steps:

first, a "video-subtitle table", a "video-audio table", and a "video-picture table" are created, and desired fields are set in the tables.

The video-subtitle table comprises a video name field, a subtitle picture name field and a subtitle interception time field, wherein the video name field is used for storing a video name needing to be preprocessed, the subtitle picture name field is used for storing a subtitle picture name obtained after preprocessing, the subtitle interception time field is used for storing video time when a subtitle is intercepted, and the unit can be seconds;

the video-audio table comprises a video name field, an audio name field and a start time field of an audio segment, wherein the video name field is used for storing a video name needing to be preprocessed, the audio name field is used for storing an audio name obtained after preprocessing, the start time field of the audio segment is used for storing video time when the audio segment is intercepted, and the unit can be seconds;

the video-image table comprises a video name field, an image name field and a start time field of a video segment, wherein the video name field is used for storing a video name needing preprocessing, and the image name field is used for storing an image name obtained after preprocessing. For convenience, the images may be named using their corresponding frame numbers. The start time field of the video segment is used for storing the video time when the video segment is intercepted, and the unit can be seconds.

Then, in order to extract the subtitle, audio, and image information contained in the video, the video is subjected to targeted processing, respectively. Meanwhile, in order to perform weight calculation for subsequent retrieval, the video needs to be segmented, wherein the time span of image interception needs to be consistent with the time span of audio interception.

And intercepting and storing images of the video possibly containing subtitles aiming at the subtitle part of the video. Specifically, the following method may be employed: firstly, traversing videos, intercepting and storing images of 1/4 parts below the videos once at intervals of time t (for example, every 2s), and finally updating a video-subtitle list;

and aiming at the audio part of the video, the video is converted into N segmented audios, so that the voice recognition is facilitated. Specifically, the following method may be employed: firstly, traversing videos and extracting the audio of each video; then, the extracted audio is segmented and stored, the segmentation is carried out at intervals of time t' (for example, the segmentation can be carried out at intervals of 60s), and finally, the video-audio table is updated;

and aiming at the image part of the video, converting the video into N segmented videos which are in one-to-one correspondence with the audio segments, and then intercepting and storing each frame of image in each segment of video. Specifically, the following method may be employed: firstly, traversing the video, segmenting the video and storing the video, wherein the time length of segmentation is consistent with that of the audio segment, and the segmentation is carried out at intervals of t' (for example, at intervals of 60 s); then capturing and storing each frame image in the video segment, wherein the images need to be named according to the captured sequence; finally, updating the video-image table;

step 2: and extracting video features.

When the video feature extraction is carried out, in view of the characteristic that the video content has multi-modal features, the method carries out feature extraction from multiple dimensions in different modes. The method comprises the following specific steps:

step 2.1: and establishing a text characteristic table, an audio characteristic table and an image characteristic table.

When the text feature table is established, a corresponding field is set in the table and used for storing the name of the video, the text features contained in the video segment and the starting node of the video segment. The starting node refers to the starting time of the video segment to which the text feature belongs, and the unit can be seconds.

When the audio feature table is established, corresponding fields are set in the table for storing the name of the video, the file name (such as a pcm file), and the start node of the audio segment.

When the image characteristic table is established, corresponding fields are set in the table and used for storing the name of the video, the name of the image and the starting node of the video segment.

Step 2.2: and extracting and storing text features, audio features and image features in the video.

Wherein the text feature comprises subtitle information for an occurrence in the video.

Specifically, the following method may be adopted to extract text features:

the first step is as follows: and traversing the video-subtitle table obtained in the step 1 to obtain a video name, a subtitle picture name and subtitle interception time.

The second step is that: and finding out a corresponding image according to the image name read in the first step, and converting the image into a gray scale image.

The third step: and performing text recognition on the image obtained in the second step.

The fourth step: integrating the text contents identified in the third step, and performing data cleaning and merging on a plurality of sections of text contents, specifically, firstly selecting a plurality of sections of text contents, for example, selecting the contents of 30 caption pictures every 2s when intercepting one caption, namely selecting the caption contents in 60s every time, keeping the same time length with the time length of intercepting the audio section, then performing data cleaning, namely deleting the character contents appearing for a plurality of times in the selected text contents, only keeping the character contents once, finally merging the cleaned selected text contents, and taking the earliest intercepting time in each selected caption as the intercepting time of the merged corresponding caption.

The fifth step: and storing the video name obtained in the first step, the text content combined in the fourth step and the caption interception time into a text feature table, wherein the caption interception time is correspondingly stored into a video segment starting node field of the text feature table, and the identified text content is correspondingly stored into a text feature field.

The following method can be adopted to extract the audio features:

the first step is as follows: and traversing the video-audio table obtained in the step 1 to obtain the video name, the audio name and the initial node of the audio segment.

The second step is that: according to the audio name read in the first step, finding out the corresponding audio for format conversion (for example, converting the audio into a single-channel pcm file with 16k16bits of depth), so that the audio is suitable for a voice recognition model;

the third step: and storing the video name obtained in the first step, the initial node of the audio segment and the file name obtained in the second step into corresponding fields of an audio feature table.

The following method can be adopted for extracting the image features:

the first step is as follows: and traversing the video-image table obtained in the step 1 to obtain a video name, an image name and a video segment initial node.

The second step is that: and finding all image frames contained in each video segment according to the video name, the video segment starting node and the image name obtained in the first step, comparing adjacent image frames of each video segment, calculating the image similarity, and deleting similar images.

The method for judging whether the image frames are the same video segment image frame comprises the following steps: and comparing whether the video name and the video segment starting node are the same, wherein the same video segment belongs to the same video segment.

The image similarity calculation method comprises the following steps: and calculating a structural similarity index of two adjacent frames of images, and when the index is greater than a set value (such as 0.7), considering that the contents of the two images are similar.

The third step: and storing the video name, the video segment name, the names of the images left after similar images are deleted and the corresponding video segment initial nodes into corresponding fields of the image feature table.

And step 3: and (4) converting the audio and image features extracted in the step (2) into text features.

Because the expression forms of the multi-modal features have great differences, if the multi-modal features are not converted into a uniform form, the multi-modal features are not beneficial to storage and are not suitable for subsequent retrieval.

Specifically, the implementation method is as follows:

when the audio features are processed, the audio features extracted in the step 2.2 are converted into text features and stored, which specifically comprises the following steps:

firstly, traversing the audio characteristic table obtained in the step 2.2 to obtain a video name, a file name and an audio segment initial node;

then, according to the obtained file name, finding a corresponding format file, and carrying out voice recognition on the file to obtain corresponding text content;

finally, storing the obtained video name, the audio segment starting node and the identified text content into a text characteristic table, wherein the audio segment starting node is correspondingly stored into a video segment starting node field of the text characteristic table, and the identified text content is correspondingly stored into a text characteristic field;

when the image features are processed, the image features extracted in the step 2.2 are converted into text features and stored, specifically as follows:

firstly, traversing the image feature table obtained in the step 2.2 to obtain a video name, an image name and a video segment starting node;

then, finding out a corresponding image according to the obtained image name, and carrying out image entity recognition and face recognition on the image to obtain entity information and face information contained in the image. The entity information refers to the category of the entity, and the face information refers to the identity information of the person to which the face belongs;

then, dividing the image frames in the same video segment into a group, and carrying out duplication removal and combination on entity information and face information identified in the image frames in the same group;

and finally, storing the obtained video name, the combined entity information and face information and the video segment starting node into a text feature table, wherein the combined entity information and face information are correspondingly stored into a text feature field.

And 4, step 4: and performing feature transformation on the multi-modal retrieval data. The multi-modal retrieval data comprises retrieval data of three modes, namely text, voice and image.

Specifically, judging whether the retrieval data is text type data or not, if so, performing no other processing, and directly entering step 5, otherwise, further judging whether the retrieval data is voice type data or not, if so, performing voice recognition on the input audio data to obtain corresponding text features, namely voice content, otherwise, defaulting the retrieval data to be image type data, and performing image entity recognition and face recognition on the input image data to obtain corresponding text features, namely entity type and character identity information contained in the image.

And 5: and (4) retrieving the video and the node where the corresponding video content is located according to the text characteristics obtained in the step (4) and calculating the weight of the retrieval result.

Specifically, step 5 comprises the following steps:

firstly, establishing a result table, and setting a video name field, a video segment starting node field and a weight value field;

then, matching the text characteristics obtained in the step (4) with text characteristic fields in a text characteristic table;

when the text features obtained in the step 4 are contained in the text feature fields in the text feature table, the corresponding contents are considered to be retrieved, the video names and the video segment starting node fields in the text feature table are recorded and stored in the result table, and the weight is initialized to 0;

and finally, traversing the result table, comparing every two results, deleting the latter when finding the results which belong to the same video segment, namely the same video name and the video segment starting node field, adding one to the weight of the former, and continuously updating the result table until the traversal is finished.

Examples

This example describes a specific embodiment of the method of the present invention.

As shown in fig. 1, the system accepts a face labeling system and a display system, wherein the face labeling system performs semi-automatic face labeling on an image entity acquired from a network and an image entity input by a user, where semi-automatic indicates that the user inputs or the face image acquired from the network is of a known type, and the labeling system performs standardized labeling on the face image according to type information, stores the face image in a face image database, and assists in face recognition.

By using the method provided by the invention, the video to be retrieved is analyzed and processed by combining the face image data in the face image database, a user can select to use voice, text or images for retrieval, the retrieval result can be input into the display system, and the user can view the result video through the display system.

TABLE 1 video-subtitling list

Serial number	Name of field	Type (B)	Description of the invention
				1	Video_name	Varchar(100)	Video name to be retrieved
2	Subtitle_name	Varchar(100)	Intercepted caption picture name
				3	Time	Int	Caption time of caption

TABLE 2 video-Audio watch

Serial number	Name of field	Type (B)	Description of the invention
				1	Video_name	Varchar(100)	Video name to be retrieved
2	Voice_name	Varchar(100)	Intercepted audio file name
				3	Time	Int	Start time of audio segment

TABLE 3 video-Picture Table

Serial number	Name of field	Type (B)	Description of the invention
				1	Video_name	Varchar(100)	Video name to be retrieved
2	Frame_name	Varchar(100)	Intercepted image frame name
				3	Time	Int	Start time of video segment

Table 4 text characteristics table

Serial number	Name of field	Type (B)	Description of the invention
				1	Video_name	Varchar(100)	Video name to be retrieved
2	Content	Varchar(1000)	Text features contained in video segments
				3	Time	Int	Start time of video segment

TABLE 5 Audio characteristics Table

Serial number	Name of field	Type (B)	Description of the invention
				1	Video_name	Varchar(100)	Video name to be retrieved
2	Voice_name	Varchar(100)	pcm File name
				3	Time	Int	Start time of audio segment

TABLE 6 image characteristics Table

Serial number	Name of field	Type (B)	Description of the invention
				1	Video_name	Varchar(100)	Video name to be retrieved
2	Picture_name	Varchar(100)	Image name
				3	Time	Int	Start time of video segment

TABLE 7 results table

Serial number	Name of field	Type (B)	Description of the invention
				1	Video_name	Varchar(100)	Video name to be retrieved
2	Time	Int	Start time of video segment
				3	Value	Int	Weight of the result

Fig. 2 is a system processing flow of a multimodal retrieval method for video content according to the present invention.

Firstly, preprocessing a video to be retrieved. Reading all videos in the video folder, preprocessing the videos according to the step 1 introduced in the invention, and updating the information into the tables 1, 2 and 3, wherein the flow chart of the preprocessing is shown in fig. 3.

In order to facilitate the retrieval of the local content of the video, when the video is preprocessed, the video needs to be firstly segmented; the subtitle is obtained mainly by segmenting the image in the video, considering that the updating frequency of the subtitle is not very high and the subtitle often appears in the lower half part of the video, the lower 1/4 image of the video is intercepted every 2s, time nodes are recorded, and relevant information is updated to the table 1; the audio acquisition mainly comprises the steps of converting a video file into an audio file, intercepting the audio every 60s, recording time nodes and updating relevant information into a table 2; the image acquisition mainly includes that video is intercepted every 60s, time nodes are recorded, then image frames in a video segment are extracted frame by frame, and finally relevant information is updated to a table 3.

Next, feature extraction is performed on the subtitle picture, the audio file, and the video image frame, and the information is updated into table 4, table 5, and table 6, and a flowchart of the feature extraction is shown in fig. 4.

Text features are mainly extracted for the Subtitle pictures, text recognition is mainly carried out for the intercepted pictures, a Subtitle _ name field in a video-Subtitle table in a table 1 is read, and the Subtitle pictures are traversed to carry out text recognition; then selecting the text contents identified in 30 subtitle pictures each time according to the time sequence when the subtitles are intercepted, carrying out data cleaning, deleting the text contents appearing for many times in the selected text contents, and only keeping the text contents once; and finally, merging the cleaned selected text contents, updating the merged contents into a Content field of the table 4, updating the interception Time of the corresponding merged caption which is the earliest interception Time in each selected caption into a Time field of the table 4, and updating the corresponding Video _ name field in the table 1 into a Video _ name field of the table 4.

For the audio file, a general audio file is mainly converted into a pcm file capable of performing Voice recognition, a Voice _ name field in a Video-audio table in a table 2 is read, format conversion is performed by traversing the audio file, and the converted pcm file name and a Video _ name field and a Time field in the table 2 are updated into a table 5.

The method mainly comprises the steps of aiming at the image features which are mainly extracted from a Video image Frame, reading a Frame _ name field, a Time field and a Video _ name field in a Video-image table in a table 3, finding all image frames with the same Time field and the same Video _ name field as image frames of the same Video segment, comparing adjacent image frames of each Video segment, calculating a structural similarity index of the image, deleting the image when the index is larger than 0.7, and updating the names of the images, the corresponding Time fields and the Video _ name fields which are left after the similar images are deleted into the corresponding fields in a table 6.

The audio features and image features are then converted to text features, and a flow chart for feature conversion is shown in fig. 5.

And reading a Voice _ name field in the audio feature table of the table 5, traversing the pcm file, and updating the recognition result, the Video _ name field and the Time field in the table 5 into the table 4 after carrying out Voice recognition.

Reading a Picture _ name field, a Time field and a Video _ name field in an image feature table in a table 6, traversing an image, and performing image entity recognition and face recognition on the image to obtain entity information and face information contained in the image; then, the image frames with the same Time field and Video _ name field are divided into a group, the entity information and the face information identified in the image frames in the same group are de-duplicated and combined, the combined Content is updated to the Content field of the table 4, and the Time field and the Video _ name field in the table 6 corresponding to the combined Content are updated to the Time field and the Video _ name field in the table 4.

Then, processing the retrieval data input by the user, wherein a flow chart is shown in fig. 6, judging whether the data format input by the user is a text format, if so, directly entering a final retrieval stage, if not, continuously judging whether the data format input by the user is an audio format, if so, performing voice recognition, then taking the converted text information as the retrieval target text feature of the retrieval stage, if not, regarding the input data format as a picture, performing image entity recognition and face recognition on the input data, and taking the recognized result as the retrieval target text feature of the retrieval stage.

And finally, retrieving the text features of the retrieval target, matching the text features of the retrieval target with the Content field in the table 4, recording the Video _ name field and the Time field in the table 4 when the Content field contains the text features of the retrieval target, combining the results of the same Video _ name field and the same Time field, counting the occurrence times of the results, updating the results as weight values into the Value field in the result table of the table 7, and updating the Video _ name field and the Time field corresponding to the combined results into the Video _ name field and the Time field in the result table of the table 7, wherein the flowchart is shown in fig. 7.

The result set obtained after the processing is a retrieval result comprising the name of the target video and the node information of the target content, and can provide accurate and comprehensive data support for subsequent video display. And the retrieval module is separated from the video processing module, so that the text characteristics of the processed video can be repeatedly utilized, and the efficiency of the video content retrieval system is improved.

While the foregoing is directed to the preferred embodiment of the present invention, it is not intended that the invention be limited to the embodiment and the drawings disclosed herein. Equivalents and modifications may be made without departing from the spirit of the disclosure, which is to be considered as within the scope of the invention.

Claims

1. A multimodal retrieval method for video content is characterized by comprising the following steps:

step 1: the video needing to be analyzed and retrieved is preprocessed based on the characteristic of characteristic multimode:

firstly, establishing a video-subtitle table, a video-audio table and a video-image table, and setting required fields in the tables;

the video-subtitle table comprises a video name field, a subtitle picture name field and a subtitle interception time field, wherein the video name field is used for storing a video name needing to be preprocessed, the subtitle picture name field is used for storing a subtitle picture name obtained after preprocessing, and the subtitle interception time field is used for storing video time when a subtitle is intercepted;

the video-audio table comprises a video name field, an audio name field and an audio segment starting time field, wherein the video name field is used for storing a video name needing to be preprocessed, the audio name field is used for storing an audio name obtained after preprocessing, and the audio segment starting time field is used for storing video time when the audio segment is intercepted;

the video-image table comprises a video name field, an image name field and a starting time field of a video segment, wherein the video name field is used for storing a video name needing to be preprocessed, and the image name field is used for storing an image name obtained after preprocessing; the starting time field of the video segment is used for storing the video time when the video segment is intercepted;

then, respectively carrying out targeted processing on the videos in order to extract subtitles, audio and image information contained in the videos; meanwhile, in order to calculate the weight of subsequent retrieval, the video is segmented, wherein the time span of image interception is consistent with the time span of audio interception;

the method comprises the following steps of intercepting and storing images which possibly contain subtitles of a video aiming at the subtitle part of the video; aiming at the audio part of the video, converting the video into N segmented audios so as to facilitate voice recognition; aiming at the image part of the video, converting the video into N segmented videos which correspond to the audio segments one by one, and then intercepting and storing each frame of image in each segment of video;

step 2: extracting video features, comprising the steps of:

step 2.1: establishing a text characteristic table, an audio characteristic table and an image characteristic table;

when a text feature table is established, setting a corresponding field in the table, wherein the corresponding field is used for storing the name of a video, the text features contained in a video segment and the initial node of the video segment; the starting node refers to the starting time of the video segment to which the text feature belongs;

when an audio characteristic table is established, setting corresponding fields in the table for storing the name and file name of a video and the initial node of an audio segment;

when an image characteristic table is established, setting corresponding fields in the table for storing the names of videos, the names of the images and the initial nodes of video segments;

step 2.2: extracting and storing text features, audio features and image features in the video; wherein the text features include subtitle information for occurrences in the video;

and step 3: converting the audio and image characteristics extracted in the step 2 into text characteristics;

and 4, step 4: performing feature transformation on the multi-modal retrieval data;

the multi-modal retrieval data comprises retrieval data of three modes of text, voice and image; firstly, judging whether the retrieval data is text type data or not, if so, performing no other processing, directly entering a step 5, otherwise, further judging whether the retrieval data is voice type data or not, if so, performing voice recognition on the input audio data to obtain corresponding text characteristics, namely voice content, otherwise, defaulting the retrieval data to be image type data, and performing image entity recognition and face recognition on the input image data to obtain corresponding text characteristics, namely entity type and character identity information contained in the image;

and 5: and (4) retrieving the video and the node where the corresponding video content is located according to the text characteristics obtained in the step (4), and calculating the weight of the retrieval result.

2. The multi-modal retrieval method for video content according to claim 1, wherein:

for the subtitle part of the video, firstly traversing the video, intercepting and storing the image of the 1/4 part below the video once at intervals of time t for each video, and finally updating a video-subtitle table;

for the audio part of the video, firstly traversing the video and extracting the audio of each video; then, dividing the extracted audio into N sections and storing the N sections, carrying out one-time division at intervals of time t', and finally updating a video-audio table;

for an image part of a video, firstly traversing the video, segmenting and storing the video, keeping the segmentation time length consistent with an audio segment, and performing segmentation once at intervals of t'; then capturing and storing each frame image in the video segment, wherein the images need to be named according to the captured sequence; and finally updating the video-image table.

3. The multi-modal retrieval method for video content as claimed in claim 1, wherein in step 2, the following method is adopted for extracting text features:

the first step is as follows: traversing the video-subtitle table obtained in the step 1 to obtain a video name, a subtitle picture name and subtitle interception time;

the second step is that: finding out a corresponding image according to the image name read in the first step, and converting the image into a gray scale image;

the third step: performing text recognition on the image obtained in the second step;

the fourth step: integrating the text contents identified in the third step, and cleaning and combining the data of the text contents;

4. The multi-modal retrieval method for video content as claimed in claim 1, wherein in step 2, the following method is adopted for extracting audio features:

the first step is as follows: traversing the video-audio table obtained in the step 1 to obtain a video name, an audio name and an initial node of an audio segment;

the second step is that: finding out corresponding audio for format conversion according to the audio name read in the first step, so that the audio is suitable for a voice recognition model;

5. The multi-modal retrieval method for video contents according to claim 1, wherein in the step 2, the following method is adopted for extracting image features:

the first step is as follows: traversing the video-image table obtained in the step 1 to obtain a video name, an image name and a video segment starting node;

the second step is that: finding all image frames contained in each video segment according to the video name, the video segment starting node and the image name obtained in the first step, comparing adjacent image frames of each video segment, calculating image similarity, and deleting similar images;

6. The multi-modal retrieving method for video content as claimed in claim 5, wherein in the second step, the method for determining whether the image frames are the same video segment image frame comprises: comparing whether the video name and the video segment starting node are the same or not, and if the video name and the video segment starting node are the same, belonging to the same video segment;

the image similarity calculation method comprises the following steps: and calculating a structural similarity index of two adjacent frames of images, and when the index is greater than a set value, considering that the contents of the two images are similar.

7. The multi-modal retrieval method for video content as claimed in claim 1, wherein in step 3, when processing the audio features, the audio features extracted in step 2.2 are converted into text features and stored, specifically as follows:

then, finding out a corresponding image according to the obtained image name, and carrying out image entity recognition and face recognition on the image to obtain entity information and face information contained in the image; the entity information refers to the category of the entity, and the face information refers to the identity information of the person to which the face belongs;

8. The multi-modal retrieval method for video content according to claim 1, wherein the step 5 comprises the steps of: