CN111651635A - Video retrieval method based on natural language description - Google Patents

Video retrieval method based on natural language description Download PDF

Info

Publication number
CN111651635A
CN111651635A CN202010467416.XA CN202010467416A CN111651635A CN 111651635 A CN111651635 A CN 111651635A CN 202010467416 A CN202010467416 A CN 202010467416A CN 111651635 A CN111651635 A CN 111651635A
Authority
CN
China
Prior art keywords
picture
video
word
description
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010467416.XA
Other languages
Chinese (zh)
Other versions
CN111651635B (en
Inventor
王春辉
胡勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Polar Intelligence Technology Co ltd
Original Assignee
Polar Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Polar Intelligence Technology Co ltd filed Critical Polar Intelligence Technology Co ltd
Priority to CN202010467416.XA priority Critical patent/CN111651635B/en
Publication of CN111651635A publication Critical patent/CN111651635A/en
Application granted granted Critical
Publication of CN111651635B publication Critical patent/CN111651635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a video retrieval method based on natural language description. The method comprises the following steps: extracting pictures from an input video file according to frames, setting the pictures into a fixed size, naming the pictures according to the frame number of each picture and storing the pictures; extracting text description of each picture, and creating a video description file by taking a video file path, a frame number of the picture and the corresponding text description as fields for each picture; and searching the video description file according to the query sentence input by the user to obtain the picture in the matched video file. Because the text description of each frame of image is generated before the query, the user can quickly obtain the video to be queried and the time positioning when querying the video with the generated text description, and the video retrieval speed is improved.

Description

Video retrieval method based on natural language description
Technical Field
The invention belongs to the technical field of natural language understanding, and particularly relates to a video retrieval method based on natural language description.
Background
Video retrieval positioning is a complex and challenging problem, and positioning at a particular moment in the video in response to a text query is associated with many visual tasks, including video retrieval, temporal action positioning, and video description and problem solving.
Video retrieval, which is the task of a given set of video candidates and a language query, utilizes a video retrieval algorithm to retrieve videos that match the query. There is a search model that matches visual concepts in a video with semantic graphs generated by parsing sentence descriptions; the video text alignment problem is solved by assigning a time interval for each sentence and a set of sentences of a given video that have a temporal order. Recently, Hendricks et al proposed a joint video language model for retrieving moments in video based on texture queries. However, these models can only verify line segments containing the corresponding moments, and much background noise exists in the returned results. Although it is possible to densely sample different proportions of video moments and use these models to retrieve the corresponding video moments, not only is the computational effort large, but the matching task is more challenging as the search space increases.
With respect to temporal localization, Gaidon et al address the problem of temporarily localizing motion in an untrimmed video, focusing on limited motion. Model 3DConvNets proposes an end-to-end segment-based 3D Convolutional Neural Network (CNN) framework that outperforms other Recursive Neural Network (RNN) based approaches by simultaneously capturing spatio-temporal information. There is also a novel time unit regression network model that can jointly predict action recommendations and refine time boundaries through time coordinate regression. Since these methods are limited to a predefined list of actions, learners suggest using natural language queries to localize the activity. They take advantage of all contextual moments around the current input without explicitly considering the semantic information of the input query.
With respect to the video question-and-answer task, attention mechanisms have achieved impressive results in neural machine translation, video captioning, and video question-and-answer. The visual attention model for video captions utilizes video frames at each time step without explicitly considering the semantic attributes of the predicted words. This is unnecessary and even misleading. To address this problem, layered Long Short Term Memory (LSTM) networks with adjusted temporal attention models for video captioning have been utilized. Later, the attention model was expanded to selectively participate not only in specific temporal or spatial regions, but also in specific forms of input, such as image features, motion features, and audio features. Recently, a multi-modal attention LSTM network has evolved faster that leverages multi-modal streaming and temporal attention to selectively focus on particular elements in the sentence generation process.
The existing video retrieval positioning method combines the above mentioned methods in other tasks to a certain extent so as to improve the model effect. However, the models are end-to-end modes, the models need to be run from the beginning for a new query or a new video, the running time is long, the models cannot be located quickly, and the use interest of users is reduced.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a video retrieval method based on natural language description.
In order to achieve the purpose, the invention adopts the following technical scheme:
a video retrieval method based on natural language description comprises the following steps:
step 1, extracting pictures from an input video file according to frames, setting the pictures into a fixed size, naming the pictures according to the frame number of each picture and storing the pictures;
step 2, extracting text description of each picture, wherein each picture is described by one sentence, and a video description file is created by taking a video file path, a frame number of the picture and the corresponding text description as fields for each picture;
and 3, searching the video description file according to the query sentence input by the user to obtain the picture in the video file corresponding to the matched text description.
Compared with the prior art, the invention has the following beneficial effects:
the method extracts the pictures from the input video file according to the frames, extracts the text description of each picture, creates the video description file by taking the video file path, the frame number of the picture and the corresponding text description as fields for each picture, searches the video description file according to the query sentence input by the user to obtain the picture in the video file corresponding to the matched text description, and realizes the video retrieval based on the natural language description. Because the description of each frame of image is generated before the query, the user can quickly obtain the video to be queried and the time positioning when querying the video with the generated text description, and the video retrieval speed is improved.
Drawings
Fig. 1 is a flowchart of a video retrieval method based on natural language description according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
The embodiment of the invention provides a video retrieval method based on natural language description, which comprises the following steps:
s101, extracting pictures from an input video file according to frames, setting the pictures into a fixed size, naming the pictures according to the frame number of each picture, and storing the pictures;
s102, extracting text description of each picture, wherein each picture is described by one sentence, and a video description file is created by taking a video file path, a frame number of the picture and the corresponding text description as fields aiming at each picture;
s103, searching the video description file according to the query sentence input by the user to obtain the picture in the video file corresponding to the matched text description.
In this embodiment, step S101 is mainly used to extract video images by frame, i.e. a series of pictures are obtained with a video file as input. The extracted pictures can be processed to the same size, for example, 720 × 480 (pixel × pixel), by using the FFmpeg module of the python model to take frames according to the number of video frames. And naming and saving the pictures by using the frame number of each picture. The frame number of the picture is the sequence number when the picture is extracted according to the frame, for example, the frame number of the first extracted picture is 1.
In this embodiment, step S102 is mainly used to obtain a text description of the picture. The DenseCap model can be used to extract textual descriptions from pictures. The text description comprises a global description and a local description, and in order to improve the query speed, only the global description generated by the model is taken, namely one picture corresponds to one text description. The DenseCap model consists of three parts: convolutional Network (Convolutional Network), full Convolutional positioning Layer (full Convolutional Localization Layer), and RNN language model. The DenseCap model can describe local details in the picture in natural language. The model can be said to be a combination of target detection and ordinary picture description, namely when the described object is a word, the model can be regarded as target detection; when the described object is the whole picture, the picture description generation is completed. The description objects of the present embodiment are all whole pictures. And after the text description is generated, combining the video path, the frame number of the picture and the picture description characters, and finally generating a video description file taking the video frame as a unit.
In this embodiment, step S103 is mainly used to search the video description file according to the query statement to obtain the corresponding video file and the corresponding picture. A search framework can be built using the whoosh library in python, and the framework performs searching based on a full-text retrieval method. The full-text retrieval comprises two processes of index creation and index search, wherein the index is established firstly, and then the index is searched to obtain text description matched with the query sentence. The picture corresponding to the text description is the picture to be inquired.
As an alternative embodiment, the method for extracting the text description of the picture in step S102 includes:
s1021, extracting a feature map of each picture by using a convolution network of a DenseCap model;
s1022, determining a candidate region and extracting a feature vector in the candidate region: firstly, inputting the feature map into a full convolution network, reversely mapping each pixel point in the feature map into an original image by taking each pixel point as an anchor point, then drawing anchor boxes, namely initial frames, with different aspect ratios and different sizes based on the anchor points, and predicting the confidence fraction and the position information of the initial frames by a positioning layer through a regression model; filtering out initial frames with the overlapping area exceeding 70% of the area with extremely high confidence scores by adopting a non-maximum inhibition mode to obtain candidate frames; finally, extracting the area in each candidate frame into a feature vector with a fixed size by adopting a bilinear interpolation method, wherein all the feature vectors form a feature matrix;
s1023, unfolding the feature matrix into a one-dimensional column vector by using a full connection layer;
s1024, inputting the one-dimensional column vector into an RNN network to obtain a code x-1Constructing a word vector sequence x with the length of T +2-1,x0,x1,x2,…,xT,x0For start flag, x1,x2,…,xTCoding the word series described by the picture text; outputting the vector sequence to RNN to train a prediction model; x is to be-1,x0Inputting the prediction model to obtain a word vector y0According to y0And predicting a first word, then taking the first word as an input of a next-layer RNN network, predicting a second word until the output word is an END mark, and obtaining the text description of the picture.
The embodiment provides a technical scheme for extracting the picture text description. The method comprises 4 steps S1021 to S1024.
Step S1021 is mainly used to extract a feature map of the picture using a convolutional network. The feature map comprises various features, such as texture, light intensity, shape and the like of the picture, and the numerical value at each position represents the strength value of a certain feature. Due to the characteristics of the convolutional neural network, the acquired features are more abstract and contain more semantic information as the number of layers is deepened. The convolutional network of the DenseCap model adopts a network structure based on VGG-16, and comprises 13 convolutional layers with convolution kernel of 3 x 3 and 4 maximum pooling layers with pooling kernel of 2 x 2. For a picture with the size of 3 × 720 × 480 (three-dimensional matrix, three dimensions refer to three color channels of red, green and blue), after the picture is subjected to a convolution network, the output result is a feature map with the size of 512 × 45 × 30. The feature map is the input to the next full convolution location layer FCL.
Step S1022 is mainly used to determine the candidate region and extract the feature vector in the candidate region. This step is mainly done by the full convolutional network. The full convolution localization layer is the core part of the entire model, similar to Faster R-Cnn, for generating the bounding box of objects within the recognition picture. The input of the method is a feature map from a convolutional network, and the output of the method is a feature vector of a plurality of (such as 300) candidate regions with fixed length, wherein each feature vector comprises three data of candidate region coordinates, a confidence score and candidate region features. A larger confidence score indicates a closer proximity to the real area. The processing procedure of the full convolution positioning layer comprises four steps: the first step is to convolve the anchor points. Firstly, each pixel point in a feature map with the size of C multiplied by W 'multiplied by H' of a convolutional network is used as an anchor point, the point is reversely mapped into an original image, then anchor boxes with different aspect ratios and sizes are drawn based on the anchor point, the number of the combined anchor boxes is k (for example, k is 12), and for each anchor box, a positioning layer in the FCL predicts corresponding confidence score and position information through a regression model. The specific calculation process is to take the feature picture as input, pass through a convolution layer with convolution kernel of 3 × 3, and then pass through a convolution layer with convolution kernel of 1 × 1, where the number of convolution kernels is 5k, so that the final output of this layer is a three-dimensional array of 5k × W '× H', which includes confidence scores and position information corresponding to all anchor points. And secondly, performing frame regression. This is a refinement of the initial bezel. Because the frame obtained in the previous step may not be particularly matched with the real region, under the supervision of the real region, linear regression is used to obtain four displacement values of the frame, and the four displacement values are mainly used to update the horizontal and vertical coordinate values of the point coordinate in the candidate region and the length and width of the candidate frame. And the third step is sampling, wherein the candidate frames obtained in the first two steps are too many, and in order to reduce the running cost, the candidate frames need to be sampled, and 300 candidate frames are selected in a non-maximum inhibition mode, wherein the non-maximum inhibition method is to remove the candidate frames with the overlapping area exceeding 70% of the area with the extremely high confidence score, so that the output of the overlapping area is reduced, and the target position is more finely positioned. The fourth step is to perform bilinear interpolation. Each candidate region obtained after sampling is a rectangular frame with different sizes and aspect ratios, namely, a candidate region, in order to establish connection with a subsequent full connection layer, namely, an identification network and an RNN language model, the model extracts the candidate region into a feature vector with a fixed size by using a bilinear interpolation method, and combines the feature vectors of all the candidate regions into a feature matrix.
Step S1023 is mainly used to expand the feature matrix obtained in the previous step into a one-dimensional column vector, and then combine all the positive sample one-dimensional column vectors into a matrix. This step is mainly accomplished by a fully connected neural network. The method is characterized in that the characteristics of each candidate region are flattened into a one-dimensional column vector, the one-dimensional column vector passes through two fully-connected layers, and a ReLU activation function and a Dropout optimization principle are used each time. Finally, each candidate region generates a one-dimensional vector with length D4096. All the one-dimensional vectors are stored to form a 300 × 4096 matrix, which is transferred to the next RNN language model. In addition, the confidence scores and the position information of the candidate regions can be refined for the second time, so that the final confidence score and the position information of each candidate region are generated. This refinement is essentially the same as the previous boundary regression except that a further boundary regression was performed on the length vector.
Step S1024 is mainly used to output a textual description of the picture. The step is mainly completed by utilizing an RNN (also called RNN language model), and the one-dimensional characteristic vector obtained in the step is used as input to output a natural language sequence based on the description picture content.
The key point of the DenseCap model is the FCLN structure and makes the positioning layer conductive using bilinear interpolation, so that end-to-end training from the picture region to the natural language description can be supported. The experimental result shows that compared with the previous network structure, the network structure of the embodiment has certain improvement in the quality of the generated picture description and the generation speed. In view of the advantages of Densecap, the embodiment uses a model pre-trained by Densecap, takes the text description of the picture as output, and constructs a file taking the video path, the frame number of the picture, and the text description of the picture as fields. Since Densecap is between object identification and common description, the finally generated picture description has more information of local areas compared with a common description generation model, and the accuracy of video retrieval positioning is improved.
As an optional embodiment, the step S103 specifically includes:
s1031, reading the video description file, inputting the text description of the picture in the video description file into a word segmentation component, removing punctuation marks and stop words, and performing word segmentation processing to obtain word elements; inputting the lemmas into a language processing component, converting the lemmas into lower case words and converting the lower case words into root words, wherein the root words are indexes;
s1032, performing lexical analysis on the query sentence input by the user, and identifying words and keywords; carrying out syntactic analysis, and constructing a syntactic tree according to syntactic rules of the query statement; performing language processing to process the query statement; searching the index to obtain the text description of the document, namely the picture, which accords with the syntax tree;
s1033, regarding each obtained document and query sentence as a word sequence, and calculating the weight of each word according to the following formula:
w=TF×loge(n/d) (1)
in the formula, w is weight, TF is the number of times of occurrence of a word in a document, d is the number of documents containing the word, and n is the total number of the documents;
and replacing each word in each document and each query statement by the weight of each word to obtain vector representation of each document and each query statement, calculating cosine similarity of the vector of each document and the vector of each query statement, wherein the picture corresponding to the document with the largest cosine similarity is the picture to be queried.
The embodiment provides a technical scheme for searching the picture matched with the query statement from the video description file. The method comprises 3 steps S1031 to S1033.
Step S1031 is mainly used to create an index. Creating an index is the process of language processing the text description in the video description file to create an index with word elements. The method is mainly realized by a word segmentation component and a language processing component. And the word segmentation component removes punctuation marks and stop words (words without practical meanings such as a and an) in the text description, and performs word segmentation processing to obtain word elements. For example, the text "I am driving a car on the road" is segmented into the lemmas "I", "driving", "car", "road". The language processing component further converts the lemmas into lower case words and converts the lower case words into root words, and the root words are the created indexes. The above example results in indices of "i", "driving", "car", "road".
Step S1032 is mainly used for searching the index. Firstly, performing lexical analysis on a query sentence, namely identifying words and keywords; then, carrying out syntactic analysis on the query statement, namely constructing a syntactic tree according to syntactic rules of the query statement; language processing, i.e., further processing of the original query statement, is also performed. And finally, searching the index established in the last step to obtain the document which is in line with the syntax tree, namely the text description of the picture.
Step S1033 is mainly used to screen out a document that is most matched with the query sentence, that is, a text description of the picture, from the documents obtained in the previous step, so as to obtain a video file and a picture to be queried. Firstly, calculating the weight of each word in each document and query sentence according to a formula (1); then, replacing each word with the weight of each word to obtain a vector represented by each document and query sentence by the weight; and calculating the cosine similarity between each document vector and the query statement vector, wherein the document with the largest cosine similarity is the text description of the picture to be queried. With the text description of the picture, the name of the video file where the picture is located and the number of the picture are also available.
The above description is only for the purpose of illustrating a few embodiments of the present invention, and should not be taken as limiting the scope of the present invention, in which all equivalent changes, modifications, or equivalent scaling-up or down, etc. made in accordance with the spirit of the present invention should be considered as falling within the scope of the present invention.

Claims (3)

1. A video retrieval method based on natural language description is characterized by comprising the following steps:
step 1, extracting pictures from an input video file according to frames, setting the pictures into a fixed size, naming the pictures according to the frame number of each picture and storing the pictures;
step 2, extracting text description of each picture, wherein each picture is described by one sentence, and a video description file is created by taking a video file path, a frame number of the picture and the text description of the picture as fields aiming at each picture;
and 3, searching the video description file according to the query sentence input by the user to obtain the picture in the video file corresponding to the matched text description.
2. The video retrieval method based on natural language description as claimed in claim 1, wherein the step 2 method for extracting text description of picture comprises:
step 2.1, extracting a characteristic map of each picture by using a convolution network of a DenseCap model;
step 2.2, determining a candidate region and extracting a feature vector in the candidate region: firstly, inputting the feature map into a full convolution network, reversely mapping each pixel point in the feature map into an original image by taking each pixel point as an anchor point, then drawing anchor boxes, namely initial frames, with different aspect ratios and different sizes based on the anchor points, and predicting the confidence fraction and the position information of the initial frames by a positioning layer through a regression model; filtering out initial frames with the overlapping area exceeding 70% of the area with extremely high confidence scores by adopting a non-maximum inhibition mode to obtain candidate frames; finally, extracting the area in each candidate frame into a feature vector with a fixed size by adopting a bilinear interpolation method, wherein all the feature vectors form a feature matrix;
step 2.3, unfolding the feature matrix of each picture into a one-dimensional column vector by using a full connection layer;
step 2.4, inputting the one-dimensional column vector into an RNN network to obtain a code x-1Constructing a word vector sequence x with the length of T +2-1,x0,x1,x2,…,xT,x0For start flag, x1,x2,…,xTCoding the word series described by the picture text; outputting the vector sequence to RNN to train a prediction model; x is to be-1,x0Inputting the prediction model to obtain a word vector y0According to y0And predicting a first word, then taking the first word as an input of a next-layer RNN network, predicting a second word until the output word is an END mark, and obtaining the text description of the picture.
3. The video retrieval method based on natural language description according to claim 1, wherein the step 3 specifically comprises:
step 3.1, reading a video description file, inputting text description of pictures in the video description file into a word segmentation component, removing punctuation marks and stop words, and performing word segmentation processing to obtain word elements; inputting the lemmas into a language processing component, converting the lemmas into lower case words and converting the lower case words into root words, wherein the root words are indexes;
step 3.2, performing lexical analysis on the query sentence input by the user, and identifying words and keywords; carrying out syntactic analysis, and constructing a syntactic tree according to syntactic rules of the query statement; performing language processing to process the query statement; searching the index to obtain the text description of the document, namely the picture, which accords with the syntax tree;
step 3.3, each obtained document and query sentence is regarded as a word sequence, and the weight of each word is calculated according to the following formula:
w=TF×loge(n/d) (1)
in the formula, w is weight, TF is the number of times of occurrence of a word in a document, d is the number of documents containing the word, and n is the total number of the documents;
and replacing each word in each document and each query statement by the weight of each word to obtain vector representation of each document and each query statement, calculating cosine similarity of the vector of each document and the vector of each query statement, wherein the picture corresponding to the document with the largest cosine similarity is the picture to be queried.
CN202010467416.XA 2020-05-28 2020-05-28 Video retrieval method based on natural language description Active CN111651635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010467416.XA CN111651635B (en) 2020-05-28 2020-05-28 Video retrieval method based on natural language description

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010467416.XA CN111651635B (en) 2020-05-28 2020-05-28 Video retrieval method based on natural language description

Publications (2)

Publication Number Publication Date
CN111651635A true CN111651635A (en) 2020-09-11
CN111651635B CN111651635B (en) 2023-04-28

Family

ID=72346989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010467416.XA Active CN111651635B (en) 2020-05-28 2020-05-28 Video retrieval method based on natural language description

Country Status (1)

Country Link
CN (1) CN111651635B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925949A (en) * 2021-02-24 2021-06-08 超参数科技(深圳)有限公司 Video frame data sampling method and device, computer equipment and storage medium
CN113468371A (en) * 2021-07-12 2021-10-01 公安部第三研究所 Method, system, device, processor and computer readable storage medium for realizing natural sentence image retrieval
CN113963304A (en) * 2021-12-20 2022-01-21 山东建筑大学 Cross-modal video time sequence action positioning method and system based on time sequence-space diagram
CN115495615A (en) * 2022-11-15 2022-12-20 浪潮电子信息产业股份有限公司 Method, device, equipment, storage medium and terminal for mutual detection of video and text

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207966A (en) * 2011-06-01 2011-10-05 华南理工大学 Video content quick retrieving method based on object tag
KR20160055511A (en) * 2014-11-10 2016-05-18 주식회사 케이티 Apparatus, method and system for searching video using rhythm
US9361523B1 (en) * 2010-07-21 2016-06-07 Hrl Laboratories, Llc Video content-based retrieval
CN105843930A (en) * 2016-03-29 2016-08-10 乐视控股(北京)有限公司 Video search method and device
KR20160099289A (en) * 2015-02-12 2016-08-22 대전대학교 산학협력단 Method and system for video search using convergence of global feature and region feature of image
CN106708929A (en) * 2016-11-18 2017-05-24 广州视源电子科技股份有限公司 Video program search method and device
CN107229737A (en) * 2017-06-14 2017-10-03 广东小天才科技有限公司 The method and electronic equipment of a kind of video search
US20170357720A1 (en) * 2016-06-10 2017-12-14 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
CN108345679A (en) * 2018-02-26 2018-07-31 科大讯飞股份有限公司 A kind of audio and video search method, device, equipment and readable storage medium storing program for executing
CN109145763A (en) * 2018-07-27 2019-01-04 天津大学 Video monitoring pedestrian based on natural language description searches for image text fusion method
CN109840287A (en) * 2019-01-31 2019-06-04 中科人工智能创新技术研究院(青岛)有限公司 A kind of cross-module state information retrieval method neural network based and device
CN110598048A (en) * 2018-05-25 2019-12-20 北京中科寒武纪科技有限公司 Video retrieval method and video retrieval mapping relation generation method and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9361523B1 (en) * 2010-07-21 2016-06-07 Hrl Laboratories, Llc Video content-based retrieval
CN102207966A (en) * 2011-06-01 2011-10-05 华南理工大学 Video content quick retrieving method based on object tag
KR20160055511A (en) * 2014-11-10 2016-05-18 주식회사 케이티 Apparatus, method and system for searching video using rhythm
KR20160099289A (en) * 2015-02-12 2016-08-22 대전대학교 산학협력단 Method and system for video search using convergence of global feature and region feature of image
CN105843930A (en) * 2016-03-29 2016-08-10 乐视控股(北京)有限公司 Video search method and device
US20170357720A1 (en) * 2016-06-10 2017-12-14 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
CN106708929A (en) * 2016-11-18 2017-05-24 广州视源电子科技股份有限公司 Video program search method and device
CN107229737A (en) * 2017-06-14 2017-10-03 广东小天才科技有限公司 The method and electronic equipment of a kind of video search
CN108345679A (en) * 2018-02-26 2018-07-31 科大讯飞股份有限公司 A kind of audio and video search method, device, equipment and readable storage medium storing program for executing
CN110598048A (en) * 2018-05-25 2019-12-20 北京中科寒武纪科技有限公司 Video retrieval method and video retrieval mapping relation generation method and device
CN109145763A (en) * 2018-07-27 2019-01-04 天津大学 Video monitoring pedestrian based on natural language description searches for image text fusion method
CN109840287A (en) * 2019-01-31 2019-06-04 中科人工智能创新技术研究院(青岛)有限公司 A kind of cross-module state information retrieval method neural network based and device

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
LUCA ROSSETTO 等: "Multimodal Video Retrieval with the 2017 IMOTION System" *
VRUSHALI A. WANKHEDE: "Content-based image retrieval from videos using CBIR and ABIR algorithm", 《IEEE》 *
朱爱红等: "基于内容的视频检索关键技术研究", 《情报杂志》 *
胡志军;徐勇;: "基于内容的视频检索综述" *
胡志军等: "基于内容的视频检索综述", 《计算机科学》 *
闫君飞;王嵩;李俊;吴刚;闫清泉;: "一种应用于视频点播系统的视频检索方法" *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925949A (en) * 2021-02-24 2021-06-08 超参数科技(深圳)有限公司 Video frame data sampling method and device, computer equipment and storage medium
CN113468371A (en) * 2021-07-12 2021-10-01 公安部第三研究所 Method, system, device, processor and computer readable storage medium for realizing natural sentence image retrieval
CN113963304A (en) * 2021-12-20 2022-01-21 山东建筑大学 Cross-modal video time sequence action positioning method and system based on time sequence-space diagram
CN113963304B (en) * 2021-12-20 2022-06-28 山东建筑大学 Cross-modal video time sequence action positioning method and system based on time sequence-space diagram
CN115495615A (en) * 2022-11-15 2022-12-20 浪潮电子信息产业股份有限公司 Method, device, equipment, storage medium and terminal for mutual detection of video and text
CN115495615B (en) * 2022-11-15 2023-02-28 浪潮电子信息产业股份有限公司 Method, device, equipment, storage medium and terminal for mutual detection of video and text

Also Published As

Publication number Publication date
CN111651635B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN111581510B (en) Shared content processing method, device, computer equipment and storage medium
CN110866140B (en) Image feature extraction model training method, image searching method and computer equipment
Karpathy et al. Deep visual-semantic alignments for generating image descriptions
CN111651635B (en) Video retrieval method based on natural language description
CN109815364B (en) Method and system for extracting, storing and retrieving mass video features
CN102549603B (en) Relevance-based image selection
CN110851641B (en) Cross-modal retrieval method and device and readable storage medium
CN110704601A (en) Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
CN114342353A (en) Method and system for video segmentation
WO2018162896A1 (en) Multi-modal image search
CN111191075B (en) Cross-modal retrieval method, system and storage medium based on dual coding and association
CN110083729B (en) Image searching method and system
CN113836992B (en) Label identification method, label identification model training method, device and equipment
CN108509521A (en) A kind of image search method automatically generating text index
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
Li et al. Adapting clip for phrase localization without further training
CN114418032A (en) Five-modal commodity pre-training method and retrieval system based on self-coordination contrast learning
CN113392265A (en) Multimedia processing method, device and equipment
CN112085120A (en) Multimedia data processing method and device, electronic equipment and storage medium
CN113449066B (en) Method, processor and storage medium for storing cultural relic data by using knowledge graph
EP3096243A1 (en) Methods, systems and apparatus for automatic video query expansion
Saleem et al. Stateful human-centered visual captioning system to aid video surveillance
CN115455249A (en) Double-engine driven multi-modal data retrieval method, equipment and system
Choe et al. Semantic video event search for surveillance video
CN115098646A (en) Multilevel relation analysis and mining method for image-text data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant