CN116644208A - Video retrieval method, device, electronic equipment and computer readable storage medium - Google Patents

Video retrieval method, device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN116644208A
CN116644208A CN202310621588.1A CN202310621588A CN116644208A CN 116644208 A CN116644208 A CN 116644208A CN 202310621588 A CN202310621588 A CN 202310621588A CN 116644208 A CN116644208 A CN 116644208A
Authority
CN
China
Prior art keywords
video
video segment
similarity
text
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310621588.1A
Other languages
Chinese (zh)
Inventor
舒畅
陈又新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310621588.1A priority Critical patent/CN116644208A/en
Publication of CN116644208A publication Critical patent/CN116644208A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to artificial intelligence technology, and discloses a video retrieval method, which comprises the following steps: dividing each preset video file through a lens to obtain a first video segment set, sequentially carrying out semantic division on each first video segment to obtain a second video segment set corresponding to the video file, extracting video segment characteristics of each second video segment by utilizing a pre-trained clip+LSTM model, fusing all video segment characteristics to obtain video characteristics of the corresponding video file, receiving a text to be searched, extracting text characteristics of the text to be searched by utilizing the pre-trained clip+LSTM model, sequentially calculating characteristic similarity between the text characteristics and the video characteristics of each preset video file, and selecting the characteristic similarity corresponding video file meeting the preset similarity condition as a target video file. The invention also provides a video retrieval device, equipment and a medium. The method and the device can improve the accuracy of medical video retrieval in the intelligent medical field.

Description

Video retrieval method, device, electronic equipment and computer readable storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a video retrieval method, a video retrieval device, an electronic device, and a computer readable storage medium.
Background
With the development of Internet and video technology, more and more intelligent medical platforms release medical science popularization videos on the Internet, so that medical science popularization knowledge is intuitively pushed to users. Because of the increasing release amount of medical science popularization videos, how to accurately and rapidly search videos required by users is an important problem of concern of large medical platforms.
The video retrieval method commonly used in the industry comprises the following steps:
first kind: matching the query text input by the user with the text title of the video;
second kind: extracting a label of the video, and matching the query text with the label of the video;
third kind: text information corresponding to the video is identified by ASR (Automatic Speech Recognition ) or OCR (Optical Character Recognition, optical character recognition) techniques, and the query text is matched with the identified video text information.
The above method is essentially matching between text (query text) and text (video tag, video title, video text information), i.e. the data in the same characterization space is matched, and in this way, the information of medical video images and pictures is always lost, so the accuracy of the video retrieval is still to be improved.
Disclosure of Invention
The invention provides a video retrieval method, a video retrieval device, electronic equipment and a computer readable storage medium, and mainly aims to improve the accuracy of medical video retrieval in the intelligent medical field.
In order to achieve the above object, the present invention provides a video retrieval method, including:
performing segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;
carrying out semantic segmentation on each first video segment in the first video segment set in turn to obtain a second video segment set of a corresponding video file;
sequentially extracting video segment characteristics of each second video segment in the second video segment set by using a pre-trained CLIP+LSTM model, and fusing all the video segment characteristics to obtain video characteristics of a corresponding video file;
receiving a text to be searched, extracting text characteristics of the text to be searched by using the pre-trained CLIP+LSTM model, sequentially calculating characteristic similarity between the text characteristics and video characteristics of each preset video file, and selecting a video file corresponding to the characteristic similarity meeting the preset similarity condition as a target video file.
Optionally, the semantic segmentation is performed on each first video segment in the first video segment set in turn to obtain a second video segment set of the corresponding video file, which includes:
identifying the text of each first video segment, and carrying out clause on the text of each first video segment;
performing sentence vector conversion on each clause of each first video segment to obtain a sentence vector set corresponding to the first video segment;
calculating adjacent window similarity and skip window similarity between every two sentence vectors in the sentence vector set to obtain corresponding vector similarity, and dividing clauses corresponding to the vector similarity meeting a preset similarity threshold into one second video segment;
and collecting all the second video segments to obtain a second video segment set corresponding to the preset video file.
Optionally, the performing sentence vector conversion on each clause of each first video segment to obtain a sentence vector set corresponding to the first video segment includes:
sequentially segmenting each clause, and carrying out word vector conversion on each segmented word;
adding word vectors corresponding to each clause to obtain a word vector matrix of each clause;
And carrying out pooling operation on each word vector matrix to obtain sentence vectors corresponding to each clause.
Optionally, the calculating the adjacent window similarity and the skip window similarity between every two sentence vectors in the sentence vector set to obtain corresponding vector similarity, and dividing the clause corresponding to the vector similarity meeting the preset similarity threshold into the second video segment includes:
step A, taking the first sentence vector in the sentence vector set as a starting point;
step B, calculating the similarity of adjacent windows between the starting point and sentence vectors adjacent to the starting point, and judging whether the similarity of the adjacent windows is larger than a preset similarity threshold;
c, when the similarity of the adjacent windows is larger than the preset semantic similarity threshold, executing the step C, and taking the starting point and the sentence vector adjacent to the starting point as a temporary video segment;
step C1, eliminating sentence vectors in the temporary video segment from the sentence vector set, and judging whether the sentence vector set after eliminating the vectors is empty or not;
when the sentence vector set after eliminating the vectors is empty, executing C11, dividing the temporary video segment into a second video segment, and jumping to the step E1;
When the sentence vector set after eliminating the vectors is not empty, executing C12, taking the first sentence vector in the sentence vector set as a starting point, calculating adjacent window similarity and skip window similarity between the starting point and vectors in the temporary video segment, weighting and averaging the adjacent window similarity and the skip window similarity to obtain vector similarity, and judging whether the vector similarity is larger than a preset similarity threshold;
when the vector similarity is greater than the preset similarity threshold, executing C121, adding the starting point into the temporary video segment, and returning to the step C1;
when the vector similarity is not greater than the preset similarity threshold, executing C122, dividing the temporary video segment into a second video segment, eliminating vectors corresponding to the second video segment from the vector set, and skipping to the step E;
when the similarity of the adjacent windows is not greater than the preset similarity threshold, executing the step D, dividing the starting point into a second video segment, and eliminating the starting point from the sentence vector set;
e, judging whether the sentence vector set after eliminating the vectors is empty or not;
E1, when the sentence vector set after eliminating the vectors is empty, executing the step E1, and collecting the second video segments to obtain a second video segment set;
and (C) returning to the step A when the sentence vector set after eliminating the vectors is not empty.
Optionally, extracting video segment features of each second video segment in the second video segment set in turn by using a pre-trained clip+lstm model, and fusing all the video segment features to obtain video features of a corresponding video file, including:
extracting video frames of each second video segment in sequence according to the time sequence to obtain a video frame set of each second video segment;
sequentially extracting frame feature vectors of each video frame in the video frame set by utilizing the CLIP part in the pre-trained clip+LSTM model;
carrying out convolution operation on all frame feature vectors of each second video segment by utilizing an LSTM part in the pre-trained CLIP+LSTM model to obtain video segment features of the corresponding second video segment;
and carrying out pooling operation on all video segment characteristics corresponding to the preset video file to obtain the video characteristics of the preset video file.
Optionally, extracting the text feature of the text to be retrieved by using the pre-trained clip+lstm model includes:
word segmentation is carried out on the text to be searched to obtain one or more search word segments, and word vectors of each search word segment are obtained;
splicing word vectors of each search word by utilizing a CLIP part in the pre-trained CLIP+LSTM model to obtain a text vector matrix;
sequentially selecting a search word as a target word, and calculating a key value of the target word according to a word vector of the target word and the text vector matrix;
selecting a preset number of search terms as feature terms according to the sequence of the key values from large to small;
and splicing word vectors of the feature word segmentation to obtain text features of the text to be searched.
Optionally, the calculating the key value of the target word according to the word vector of the target word and the text vector matrix includes:
calculating the key value of the target word by using the following key value algorithm:
wherein K is the key value, W is the text vector matrix, T is a matrix transpose symbol, W is a modulo symbol, Word vectors that segment the target word.
In order to solve the above problems, the present invention also provides a video retrieval apparatus, the apparatus comprising:
the system comprises a shot segmentation module, a video segmentation module and a video segmentation module, wherein the shot segmentation module is used for executing segmentation operation on each preset video file through a shot to obtain a first video segment set corresponding to each preset video file;
the semantic segmentation module is used for sequentially carrying out semantic segmentation on each first video segment in the first video segment set to obtain a second video segment set of the corresponding video file;
the video feature extraction module is used for sequentially extracting video segment features of each second video segment in the second video segment set by utilizing a pre-trained CLIP+LSTM model, and fusing all the video segment features to obtain video features of corresponding video files;
and the text and video feature comparison module is used for receiving the text to be searched, extracting the text features of the text to be searched by utilizing the pre-trained CLIP+LSTM model, sequentially calculating the feature similarity between the text features and the video features of each preset video file, and selecting the video file corresponding to the feature similarity meeting the preset similarity condition as a target video file.
In order to solve the above-described problems, the present invention also provides a computer-readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the video retrieval method described above.
According to the embodiment of the invention, the preset video file is segmented according to the lens and segmented according to the semantics, so that the video file is accurately refined, the video segment characteristics of the refined second video segment are extracted, the video characteristics of the final video file are obtained based on the video segment characteristics, the video characteristics embody the image characteristics of the video file, finally, the video file meeting the preset similarity condition is selected as the target video file in a mode of calculating the characteristic similarity between the text characteristics of the text to be searched and the video characteristics of the video file, and the searching mode is based on the video characteristics of the video file, so that the searching accuracy of the video file is improved.
Drawings
Fig. 1 is a flow chart of a video retrieval method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a detailed implementation of one of the steps in the video searching method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating another step in the video searching method according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating another step in the video searching method according to an embodiment of the present application;
FIG. 5 is a functional block diagram of a video retrieval device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device for implementing the video retrieval method according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The embodiment of the application provides a video retrieval method. The execution subject of the video retrieval method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the video retrieval method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (ContentDelivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.
Referring to fig. 1, a flowchart of a video retrieval method according to an embodiment of the invention is shown.
In this embodiment, the video retrieval method includes:
s1, executing segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;
in the embodiment of the invention, an intelligent medical platform is taken as an example for explanation. The intelligent medical platform is used for providing medical assistance and medical knowledge science popularization for common users by maintaining and releasing a series of medical science popularization videos. The preset video file refers to a medical video maintained by the intelligent medical platform, for example, a common disease prevention knowledge video, a family medical aid common knowledge video, a public health protection video, a medical hot event video and the like.
It will be appreciated that, in general, a video text is composed of one or more groups of shots, each shot representing an independent meaning, each of the preset video files is segmented according to shots, and one of the preset video files is segmented into a plurality of video segments, each of the video segments containing image information of one of the shots. Thus, the operation is convenient, and the characteristics of each preset video file can be better acquired later.
In the embodiment of the invention, each preset video file can be segmented by using a lens segmentation tool disclosed in OPENCV.
S2, carrying out semantic segmentation on each first video segment in the first video segment set in sequence to obtain a second video segment set of a corresponding video file;
it will be appreciated that if a shot contains a lot of semantic information, e.g. a long shot. Each first video segment can be subdivided, so that the semantics of the second video segment obtained through final cutting are purer, granularity is not overlarge, and the accuracy of extracting video features of a corresponding video file based on the final cut video segment is improved.
In detail, referring to fig. 2, the step S2 includes:
s21, identifying the text of each first video segment, and carrying out clause on the text of each first video segment;
s22, performing sentence vector conversion on each clause of each first video segment to obtain a sentence vector set corresponding to the first video segment;
s23, calculating adjacent window similarity and skip window similarity between every two sentence vectors in the sentence vector set to obtain corresponding vector similarity, and dividing a clause corresponding to the vector similarity meeting a preset similarity threshold into a second video segment;
S24, collecting all the second video segments to obtain a second video segment set corresponding to the preset video file.
In the embodiment of the invention, the ASR technology can be utilized to acquire the video segment text of each first video segment, and after the video segment text is divided into sentences, semantic segmentation is carried out on the first video segment by taking the sentence as a unit.
In another optional embodiment of the present invention, an acoustic model may be used to perform speech recognition based on speech information corresponding to the first video segment to obtain a video segment text corresponding to the first video, where the acoustic model performs speech recognition on speech information by modeling each word to establish a database including a plurality of words and standard speech corresponding to each word, and performs probability matching on the speech information to obtain the video segment text by collecting user speech in the first video segment at each moment in the speech information to obtain the speech of the user at each moment, where the speech is pre-constructed and includes a plurality of words and words in a database of standard speech corresponding to each word.
In an alternative embodiment of the present invention, sentence vector conversion may be performed on each of the clauses by the following method:
Sequentially segmenting each clause, and carrying out word vector conversion on each segmented word;
adding word vectors corresponding to each clause to obtain a word vector matrix of each clause;
and carrying out pooling operation on each word vector matrix to obtain sentence vectors corresponding to each clause.
In the embodiment of the invention, the word segmentation processing can be performed on the clause by adopting a preset standard dictionary to obtain a plurality of segmented words, wherein the standard dictionary comprises a plurality of standard segmented words. The clause may also be segmented using a segmentation tool, such as jieba segmentation.
In the embodiment of the invention, word2vec model, NLP (Natural Language Processing ) model and other models with word vector conversion function can be adopted to carry out word vector conversion on each word.
In an alternative embodiment of the present invention, the word vector matrix may be pooled by using a k-max pooling method, where the k value may be predefined, for example, the k value may take 35%. Since the number of the words included in each clause is not equal, it is preferable to round up 35% of the words in each clause, and take the top K maximum values in each pooling block, for example, if clause 1 includes only 1 word, 1×35% rounds up to 1, and clause 2 includes 5 words, and 5×35% rounds up to 2.
In detail, referring to fig. 3, the calculating the adjacent window similarity and the skip window similarity between every two sentence vectors in the sentence vector set to obtain the corresponding vector similarity divides the clause corresponding to the vector similarity meeting the preset similarity threshold into one second video segment includes:
step A, taking the first sentence vector in the sentence vector set as a starting point;
step B, calculating the similarity of adjacent windows between the starting point and sentence vectors adjacent to the starting point, and judging whether the similarity of the adjacent windows is larger than a preset similarity threshold;
c, when the similarity of the adjacent windows is larger than the preset semantic similarity threshold, executing the step C, and taking the starting point and the sentence vector adjacent to the starting point as a temporary video segment;
step C1, eliminating sentence vectors in the temporary video segment from the sentence vector set, and judging whether the sentence vector set after eliminating the vectors is empty or not;
when the sentence vector set after eliminating the vectors is empty, executing C11, dividing the temporary video segment into a second video segment, and jumping to the step E1;
when the sentence vector set after eliminating the vectors is not empty, executing C12, taking the first sentence vector in the sentence vector set as a starting point, calculating adjacent window similarity and skip window similarity between the starting point and vectors in the temporary video segment, weighting and averaging the adjacent window similarity and the skip window similarity to obtain vector similarity, and judging whether the vector similarity is larger than a preset similarity threshold;
When the vector similarity is greater than the preset similarity threshold, executing C121, adding the starting point into the temporary video segment, and returning to the step C1;
when the vector similarity is not greater than the preset similarity threshold, executing C122, dividing the temporary video segment into a second video segment, eliminating vectors corresponding to the second video segment from the vector set, and skipping to the step E;
when the similarity of the adjacent windows is not greater than the preset similarity threshold, executing the step D, dividing the starting point into a second video segment, and eliminating the starting point from the sentence vector set;
e, judging whether the sentence vector set after eliminating the vectors is empty or not;
e1, when the sentence vector set after eliminating the vectors is empty, executing the step E1, and collecting the second video segments to obtain a second video segment set;
and (C) returning to the step A when the sentence vector set after eliminating the vectors is not empty.
In the embodiment of the invention, the adjacent window similarity and the skip window similarity are the similarity between every two sentence vectors, namely the adjacent window similarity and the skip window similarity. For example, if the sentence vectors of a first video segment include S1, S2, S3, S4, and S5, respectively, then there is a neighboring window similarity between S1 and S2, a skip window similarity between S1 and S3, a neighboring window similarity between S2 and S3, a skip window similarity between S2 and S4, and a skip window similarity between S3 and S5.
In the embodiment of the invention, the adjacent window similarity or the jump window similarity between every two sentence vectors can be calculated by utilizing a pre-trained MLP (Multilayer Perceptron, multi-layer perceptron) model.
In the embodiment of the invention, different weights can be allocated to the adjacent window similarity and the jump window similarity in advance, and finally the vector similarity is obtained by carrying out weighted averaging operation on the adjacent window similarity and the jump window similarity.
In the embodiment of the invention, the preset similarity threshold can be set according to the actual service condition.
For example, assuming that a first video segment includes 4 clauses, the corresponding sentence vectors are S1, S2, S3, and S4, and the semantic segmentation is performed on the first video segment, the following division results may be obtained:
the first division result includes: s1, S2, S3 and S4, wherein the similarity of adjacent windows between S1 and S2 is smaller than the preset similarity threshold, so that the clause where S1 is located is divided into an independent second video segment; the similarity of adjacent windows between S2 and S3 is larger than the preset similarity threshold, and the vector similarity corresponding to the similarity of adjacent windows between S2 and S3 and the similarity of jump windows between S2 and S4 is smaller than the preset similarity threshold, so that S2 and S3 are divided into an independent second video segment, and S4 is divided into an independent video segment;
The second division result includes: three second video segments s1+s2, S3 and S4, wherein the adjacent window similarity between S1 and S2 is greater than the preset similarity threshold, and the vector similarity corresponding to the adjacent window similarity between S2 and S3 and the jump window similarity between S1 and S3 is less than the preset similarity threshold, so that S1 and S2 are divided into an independent second video segment; the similarity of adjacent windows between S3 and S4 is smaller than the preset similarity threshold, so that the clause where S3 is located is divided into an independent second video segment; s4, dividing the clause corresponding to the S4 into an independent second video segment;
the third division result includes: and if the adjacent window similarity between S1 and S2 is larger than the preset similarity threshold, the vector similarity corresponding to the adjacent window similarity between S2 and S3 and the jump window similarity between S1 and S3 is larger than the preset similarity threshold, the vector similarity corresponding to the adjacent window similarity between S3 and S4 and the jump window similarity between S2 and S4 is smaller than the preset similarity threshold, S4 is an independent second video segment, and S1, S2 and S3 are independent second video segments.
It should be noted that the above is only an example, and there may be a plurality of division results for S1, S2, S3, S4.
S3, extracting video segment characteristics of each second video segment in the second video segment set in sequence by utilizing a pre-trained CLIP+LSTM model, and fusing all the video segment characteristics to obtain video characteristics of a corresponding video file;
in the embodiment of the invention, the Pre-trained CLIP+LSTM model comprises a CLIP part (Contrastive Language-Image Pre-training) and an LSTM part (Long Short-Term Memory network).
In detail, the extracting video segment features of each second video segment in the second video segment set sequentially by using a pre-trained clip+lstm model, and fusing all the video segment features to obtain video features of a corresponding video file includes:
extracting video frames of each second video segment in sequence according to the time sequence to obtain a video frame set of each second video segment;
sequentially extracting frame feature vectors of each video frame in the video frame set by utilizing the CLIP part in the pre-trained clip+LSTM model;
carrying out convolution operation on all frame feature vectors of each second video segment by utilizing an LSTM part in the pre-trained CLIP+LSTM model to obtain video segment features of the corresponding second video segment;
And carrying out pooling operation on all video segment characteristics corresponding to the preset video file to obtain the video characteristics of the preset video file.
In an optional embodiment of the present invention, 4 video frames are extracted from each second video segment according to the time sequence, so as to form a video frame set of the second video segment.
According to the embodiment of the invention, the video frames of each second video segment are sequentially extracted by taking the second video segment as a unit, the video segment characteristics of the corresponding second video segment are extracted based on the frame characteristic vector of each video frame by utilizing the pre-trained CLIP+LSTM model, and finally the video characteristics of the corresponding video file are obtained based on the video segment characteristics of all the second video segments, wherein the video characteristics embody the image and picture characteristics of the video.
And S4, receiving a text to be searched, extracting text characteristics of the text to be searched by using the pre-trained CLIP+LSTM model, sequentially calculating characteristic similarity between the text characteristics and video characteristics of each preset video file, and selecting a video file corresponding to the characteristic similarity meeting the preset similarity condition as a target video file.
In the embodiment of the invention, the text characteristics of the text to be searched can be extracted by using the same model, namely the pre-trained CLIP+LSTM model. The text characteristics of the text to be searched and the video characteristics of the video file can be mapped to the same characterization space by the operation, so that the text characteristics and the video characteristics of the text to be searched can be conveniently compared and calculated.
In detail, referring to fig. 4, the extracting text features of the text to be retrieved by using the pretrained clip+lstm model includes:
s41, word segmentation is carried out on the text to be searched to obtain one or more search word segments, and word vectors of each search word segment are obtained;
s42, splicing word vectors of each search word by utilizing the CLIP part in the pre-trained CLIP+LSTM model to obtain a text vector matrix;
s43, sequentially selecting a search word as a target word, and calculating a key value of the target word according to a word vector of the target word and the text vector matrix;
s44, selecting a preset number of search terms as characteristic terms according to the sequence of the key values from large to small;
s45, word vectors of the feature word segmentation are spliced to obtain text features of the text to be searched.
In detail, since the text to be searched contains a large number of search terms, but not every search term is a feature of the text to be searched, the search terms are required to be screened, one of the search terms is selected one by one from the search terms to be a target term, and a key value of the target term is calculated according to a term vector of the target term and the text vector matrix, so that feature terms which are representative of the text to be searched are screened according to the key value, and the text feature of the text to be searched is obtained.
Specifically, the calculating the key value of the target word according to the word vector of the target word and the text vector matrix includes:
calculating the key value of the target word by using the following key value algorithm:
wherein K is the key value, W is the text vector matrix, T is a matrix transpose symbol, W is a modulo symbol,word vectors that segment the target word.
In the embodiment of the invention, the preset number of search terms are selected from the plurality of search terms as characteristic terms according to the order of the key value of each search term from large to small.
For example, the plurality of search terms includes: and selecting the search word A and the search word B as feature words according to the sequence of the key values from big to small if the preset number is 2, and splicing word vectors of the search word A and the search word B to obtain text features of the text to be searched.
In the embodiment of the invention, the feature similarity can be obtained by calculating the cosine similarity between the text feature and the video feature of each video file.
Further, the cosine similarity can be obtained by calculating the cosine similarity between the text feature and the video feature of each second video segment, and the second video segment closest to the text feature is matched, so that the accuracy of video retrieval is further improved.
According to the embodiment of the invention, the preset video file is segmented according to the lens and segmented according to the semantics, so that the video file is accurately refined, the video segment characteristics of the refined second video segment are extracted, the video characteristics of the final video file are obtained based on the video segment characteristics, the video characteristics embody the image characteristics of the video file, finally, the video file meeting the preset similarity condition is selected as the target video file in a mode of calculating the characteristic similarity between the text characteristics of the text to be searched and the video characteristics of the video file, and the searching mode is based on the video characteristics of the video file, so that the searching accuracy of the video file is improved.
Fig. 5 is a functional block diagram of a video search device according to an embodiment of the present invention.
The video search apparatus 100 of the present invention may be mounted in an electronic device. According to the implemented functions, the video retrieval apparatus 100 includes: a shot segmentation module 101, a semantic segmentation module 102, a video feature extraction module 103 and a text-to-video feature comparison module 104. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the shot segmentation module 101 is configured to perform a segmentation operation on each preset video file through a shot to obtain a first video segment set corresponding to each preset video file;
the semantic segmentation module 102 is configured to perform semantic segmentation on each first video segment in the first video segment set in sequence to obtain a second video segment set of a corresponding video file;
the video feature extraction module 103 is configured to sequentially extract video segment features of each second video segment in the second video segment set by using a pre-trained clip+lstm model, and fuse all the video segment features to obtain video features of a corresponding video file;
The text-to-video feature comparison module 104 is configured to receive a text to be retrieved, extract text features of the text to be retrieved by using the pre-trained clip+lstm model, sequentially calculate feature similarities between the text features and video features of each preset video file, and select a video file corresponding to the feature similarities that satisfies a preset similarity condition as a target video file.
In detail, each module of the video searching apparatus 100 in the embodiment of the present invention adopts the same technical means as the video searching method described in fig. 1 to 4, and can produce the same technical effects, which are not repeated here.
Fig. 6 is a schematic structural diagram of an electronic device for implementing a video retrieval method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a video retrieval program, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of video search programs, but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing Unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective parts of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device 1 and processes data by running or executing programs or modules (e.g., video retrieval programs, etc.) stored in the memory 11, and calling data stored in the memory 11.
The bus may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 6 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 6 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
Further, the electronic device 1 may also comprise a network interface, optionally the network interface may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The video retrieval program stored in the memory 11 in the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:
performing segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;
carrying out semantic segmentation on each first video segment in the first video segment set in turn to obtain a second video segment set of a corresponding video file;
Sequentially extracting video segment characteristics of each second video segment in the second video segment set by using a pre-trained CLIP+LSTM model, and fusing all the video segment characteristics to obtain video characteristics of a corresponding video file;
receiving a text to be searched, extracting text characteristics of the text to be searched by using the pre-trained CLIP+LSTM model, sequentially calculating characteristic similarity between the text characteristics and video characteristics of each preset video file, and selecting a video file corresponding to the characteristic similarity meeting the preset similarity condition as a target video file.
Further, the modules/units integrated in the electronic device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, can implement:
performing segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;
carrying out semantic segmentation on each first video segment in the first video segment set in turn to obtain a second video segment set of a corresponding video file;
sequentially extracting video segment characteristics of each second video segment in the second video segment set by using a pre-trained CLIP+LSTM model, and fusing all the video segment characteristics to obtain video characteristics of a corresponding video file;
receiving a text to be searched, extracting text characteristics of the text to be searched by using the pre-trained CLIP+LSTM model, sequentially calculating characteristic similarity between the text characteristics and video characteristics of each preset video file, and selecting a video file corresponding to the characteristic similarity meeting the preset similarity condition as a target video file.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The embodiment of the application can acquire and process the related data based on the holographic projection technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application and not for limiting the same, and although the present application has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present application without departing from the spirit and scope of the technical solution of the present application.

Claims (10)

1. A video retrieval method, the method comprising:
performing segmentation operation on each preset video file through a lens to obtain a first video segment set corresponding to each preset video file;
Carrying out semantic segmentation on each first video segment in the first video segment set in turn to obtain a second video segment set of a corresponding video file;
sequentially extracting video segment characteristics of each second video segment in the second video segment set by using a pre-trained CLIP+LSTM model, and fusing all the video segment characteristics to obtain video characteristics of a corresponding video file;
receiving a text to be searched, extracting text characteristics of the text to be searched by using the pre-trained CLIP+LSTM model, sequentially calculating characteristic similarity between the text characteristics and video characteristics of each preset video file, and selecting a video file corresponding to the characteristic similarity meeting the preset similarity condition as a target video file.
2. The video retrieval method according to claim 1, wherein said sequentially semantically segmenting each first video segment in the first set of video segments to obtain a second set of video segments of the corresponding video file comprises:
identifying the text of each first video segment, and carrying out clause on the text of each first video segment;
performing sentence vector conversion on each clause of each first video segment to obtain a sentence vector set corresponding to the first video segment;
Calculating adjacent window similarity and skip window similarity between every two sentence vectors in the sentence vector set to obtain corresponding vector similarity, and dividing clauses corresponding to the vector similarity meeting a preset similarity threshold into one second video segment;
and collecting all the second video segments to obtain a second video segment set corresponding to the preset video file.
3. The method for retrieving video according to claim 2, wherein said performing sentence vector conversion on each of said clauses of each of said first video segments to obtain a set of sentence vectors corresponding to the first video segments comprises:
sequentially segmenting each clause, and carrying out word vector conversion on each segmented word;
adding word vectors corresponding to each clause to obtain a word vector matrix of each clause;
and carrying out pooling operation on each word vector matrix to obtain sentence vectors corresponding to each clause.
4. The method of claim 2, wherein the calculating the adjacent window similarity and the skip window similarity between each two sentence vectors in the sentence vector set to obtain the corresponding vector similarity, and dividing the clause corresponding to the vector similarity meeting the preset similarity threshold into the second video segment includes:
Step A, taking the first sentence vector in the sentence vector set as a starting point;
step B, calculating the similarity of adjacent windows between the starting point and sentence vectors adjacent to the starting point, and judging whether the similarity of the adjacent windows is larger than a preset similarity threshold;
c, when the similarity of the adjacent windows is larger than the preset semantic similarity threshold, executing the step C, and taking the starting point and the sentence vector adjacent to the starting point as a temporary video segment;
step C1, eliminating sentence vectors in the temporary video segment from the sentence vector set, and judging whether the sentence vector set after eliminating the vectors is empty or not;
when the sentence vector set after eliminating the vectors is empty, executing C11, dividing the temporary video segment into a second video segment, and jumping to the step E1;
when the sentence vector set after eliminating the vectors is not empty, executing C12, taking the first sentence vector in the sentence vector set as a starting point, calculating adjacent window similarity and skip window similarity between the starting point and vectors in the temporary video segment, weighting and averaging the adjacent window similarity and the skip window similarity to obtain vector similarity, and judging whether the vector similarity is larger than a preset similarity threshold;
When the vector similarity is greater than the preset similarity threshold, executing C121, adding the starting point into the temporary video segment, and returning to the step C1;
when the vector similarity is not greater than the preset similarity threshold, executing C122, dividing the temporary video segment into a second video segment, eliminating vectors corresponding to the second video segment from the vector set, and skipping to the step E;
when the similarity of the adjacent windows is not greater than the preset similarity threshold, executing the step D, dividing the starting point into a second video segment, and eliminating the starting point from the sentence vector set;
e, judging whether the sentence vector set after eliminating the vectors is empty or not;
e1, when the sentence vector set after eliminating the vectors is empty, executing the step E1, and collecting the second video segments to obtain a second video segment set;
and (C) returning to the step A when the sentence vector set after eliminating the vectors is not empty.
5. The video retrieval method according to claim 1, wherein the step of sequentially extracting video segment features of each second video segment in the second video segment set by using a pre-trained clip+lstm model, and fusing all the video segment features to obtain video features of a corresponding video file includes:
Extracting video frames of each second video segment in sequence according to the time sequence to obtain a video frame set of each second video segment;
sequentially extracting frame feature vectors of each video frame in the video frame set by utilizing the CLIP part in the pre-trained clip+LSTM model;
carrying out convolution operation on all frame feature vectors of each second video segment by utilizing an LSTM part in the pre-trained CLIP+LSTM model to obtain video segment features of the corresponding second video segment;
and carrying out pooling operation on all video segment characteristics corresponding to the preset video file to obtain the video characteristics of the preset video file.
6. The video retrieval method according to claim 1, wherein extracting text features of the text to be retrieved using the pre-trained clip+lstm model comprises:
word segmentation is carried out on the text to be searched to obtain one or more search word segments, and word vectors of each search word segment are obtained;
splicing word vectors of each search word by utilizing a CLIP part in the pre-trained CLIP+LSTM model to obtain a text vector matrix;
sequentially selecting a search word as a target word, and calculating a key value of the target word according to a word vector of the target word and the text vector matrix;
Selecting a preset number of search terms as feature terms according to the sequence of the key values from large to small;
and splicing word vectors of the feature word segmentation to obtain text features of the text to be searched.
7. The video retrieval method according to claim 6, wherein the calculating the key value of the target word according to the word vector of the target word and the text vector matrix comprises:
calculating the key value of the target word by using the following key value algorithm:
wherein K is the key value, W is the text vector matrix, T is a matrix transpose symbol, W is a modulo symbol,word vectors that segment the target word.
8. A video retrieval apparatus, the apparatus comprising:
the system comprises a shot segmentation module, a video segmentation module and a video segmentation module, wherein the shot segmentation module is used for executing segmentation operation on each preset video file through a shot to obtain a first video segment set corresponding to each preset video file;
the semantic segmentation module is used for sequentially carrying out semantic segmentation on each first video segment in the first video segment set to obtain a second video segment set of the corresponding video file;
the video feature extraction module is used for sequentially extracting video segment features of each second video segment in the second video segment set by utilizing a pre-trained CLIP+LSTM model, and fusing all the video segment features to obtain video features of corresponding video files;
And the text and video feature comparison module is used for receiving the text to be searched, extracting the text features of the text to be searched by utilizing the pre-trained CLIP+LSTM model, sequentially calculating the feature similarity between the text features and the video features of each preset video file, and selecting the video file corresponding to the feature similarity meeting the preset similarity condition as a target video file.
9. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the video retrieval method of any one of claims 1 to 7.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the video retrieval method according to any one of claims 1 to 7.
CN202310621588.1A 2023-05-30 2023-05-30 Video retrieval method, device, electronic equipment and computer readable storage medium Pending CN116644208A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310621588.1A CN116644208A (en) 2023-05-30 2023-05-30 Video retrieval method, device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310621588.1A CN116644208A (en) 2023-05-30 2023-05-30 Video retrieval method, device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116644208A true CN116644208A (en) 2023-08-25

Family

ID=87643017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310621588.1A Pending CN116644208A (en) 2023-05-30 2023-05-30 Video retrieval method, device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116644208A (en)

Similar Documents

Publication Publication Date Title
WO2021139191A1 (en) Method for data labeling and apparatus for data labeling
CN111222305A (en) Information structuring method and device
CN113157927B (en) Text classification method, apparatus, electronic device and readable storage medium
CN113821622B (en) Answer retrieval method and device based on artificial intelligence, electronic equipment and medium
CN112860848A (en) Information retrieval method, device, equipment and medium
CN114021582A (en) Spoken language understanding method, device, equipment and storage medium combined with voice information
CN110659392B (en) Retrieval method and device, and storage medium
CN113157739B (en) Cross-modal retrieval method and device, electronic equipment and storage medium
CN116450829A (en) Medical text classification method, device, equipment and medium
CN116719904A (en) Information query method, device, equipment and storage medium based on image-text combination
CN115409041B (en) Unstructured data extraction method, device, equipment and storage medium
CN116468025A (en) Electronic medical record structuring method and device, electronic equipment and storage medium
CN116628162A (en) Semantic question-answering method, device, equipment and storage medium
CN113806540B (en) Text labeling method, text labeling device, electronic equipment and storage medium
CN113705692B (en) Emotion classification method and device based on artificial intelligence, electronic equipment and medium
CN115221276A (en) Chinese image-text retrieval model training method, device, equipment and medium based on CLIP
CN116644208A (en) Video retrieval method, device, electronic equipment and computer readable storage medium
CN112364068A (en) Course label generation method, device, equipment and medium
CN113656703B (en) Intelligent recommendation method, device, equipment and storage medium based on new online courses
CN114723523B (en) Product recommendation method, device, equipment and medium based on user capability image
CN114462411B (en) Named entity recognition method, device, equipment and storage medium
CN116597362A (en) Hot spot video clip identification method and device, electronic equipment and medium
CN114840560B (en) Unstructured data conversion and storage method and device
CN116629251A (en) Method and device for long text segmentation, electronic equipment and computer readable storage medium
CN116737996A (en) Multi-mode video retrieval method, device, equipment and medium based on multi-encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination