CN114282049A - Video retrieval method, device, equipment and storage medium - Google Patents

Video retrieval method, device, equipment and storage medium Download PDF

Info

Publication number
CN114282049A
CN114282049A CN202111566652.8A CN202111566652A CN114282049A CN 114282049 A CN114282049 A CN 114282049A CN 202111566652 A CN202111566652 A CN 202111566652A CN 114282049 A CN114282049 A CN 114282049A
Authority
CN
China
Prior art keywords
text
video
similarity
feature representation
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111566652.8A
Other languages
Chinese (zh)
Inventor
贺峰
汪琦
冯知凡
柴春光
朱勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111566652.8A priority Critical patent/CN114282049A/en
Publication of CN114282049A publication Critical patent/CN114282049A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a video retrieval method, apparatus, device and storage medium, which relate to the technical field of artificial intelligence, specifically to the technical field of deep learning, knowledge graph and computer vision, and can be used in video retrieval scenes. The specific implementation scheme is as follows: determining a first text characteristic representation of a search text in a video search request; determining a second text characteristic representation of the candidate video according to the subtitle information of the candidate video; determining visual feature representation of the candidate video according to the image information of the candidate video; and selecting a target video from the candidate videos according to the first text feature representation, the visual feature representation of the candidate videos and the second text feature representation. By the technical scheme, the video required by the user can be efficiently and accurately acquired from massive videos.

Description

Video retrieval method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular to the field of deep learning, knowledge mapping, and computer vision technologies, applicable to video retrieval scenarios.
Background
With the development of the internet and artificial intelligence technology, video data is multiplied. In the face of an increasing mass of videos, an efficient and accurate video retrieval technology is needed.
Disclosure of Invention
The disclosure provides a video retrieval method, a video retrieval device, video retrieval equipment and a storage medium.
According to an aspect of the present disclosure, there is provided a video retrieval method, including:
determining a first text characteristic representation of a search text in a video search request;
determining a second text characteristic representation of the candidate video according to the subtitle information of the candidate video;
determining visual feature representation of the candidate video according to the image information of the candidate video;
and selecting a target video from the candidate videos according to the first text feature representation, the visual feature representation of the candidate videos and the second text feature representation.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a video retrieval method according to any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a video retrieval method according to any one of the embodiments of the present disclosure.
According to the technology disclosed by the invention, the video required by the user can be efficiently and accurately acquired from a large amount of videos.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a flowchart of a video retrieval method provided according to an embodiment of the present disclosure;
fig. 2 is a flow chart of another video retrieval method provided according to an embodiment of the present disclosure;
fig. 3 is a flowchart of another video retrieval method provided according to an embodiment of the present disclosure;
fig. 4 is a flowchart of another video retrieval method provided according to an embodiment of the present disclosure;
fig. 5 is a flowchart of yet another video retrieval method provided in accordance with an embodiment of the present disclosure;
fig. 6 is a schematic diagram of a video retrieval process provided according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a video retrieval apparatus according to an embodiment of the present disclosure;
fig. 8 is a block diagram of an electronic device for implementing a video retrieval method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
A significant feature of the internet era is the multiplication of data, known as mackentin: "data, which has penetrated into every industry and business function area today, becomes an important production factor. People's mining and application of mass data indicate a new wave of productivity increase and the arrival of surplus wave of consumers. ". Video data is no exception, and in the face of ever-increasing video libraries, an efficient and accurate retrieval technique is needed.
The traditional retrieval technology needs to convert video into text information and then realize the video retrieval function through text retrieval. The video is not sufficiently represented in the mode, the video cannot be effectively distinguished, and the accuracy of video retrieval is low. Based on this, the present disclosure provides a new video retrieval method.
Fig. 1 is a flowchart of a video retrieval method provided according to an embodiment of the present disclosure. The embodiment of the disclosure is suitable for the situation of how to perform video retrieval. The method can be executed by a video retrieval device, which can be implemented in software and/or hardware, and can be integrated into an electronic device with a video retrieval function, such as a server.
As shown in fig. 1, the video retrieval method of the present embodiment may include:
s101, determining a first text characteristic representation of a search text in the video search request.
In this embodiment, the video search request refers to a request for video search, and may include a search text and the like. The first text feature representation is a representation of the feature of the search text in space, and may be represented in a vector or matrix form.
Alternatively, the search text may include various data types, such as numerical data (e.g., duration of video), text data (e.g., title of movie or television show, name of actor, etc.), enumerated data (e.g., type of video), etc. Therefore, for convenience of processing, the embodiment may unify different data in the search text into a vector form. Optionally, different data types are vectorized in different ways.
For example, the numerical data may be processed directly or after normalization as a numerical vector. For text type data, word segmentation can be performed by adopting a word segmentation device (jieba), vectorization is performed by utilizing a word2vec model trained in advance to obtain semantic vectors of all words, and the semantic vectors of all words are averaged to obtain an average vector; or vectorization can be performed by using an Enhanced Language Representation from knowledge Enhanced semantic IntEgration (ERNIE). For enumerated data, one-hot vectors can be processed, wherein the length of the one-hot vectors is determined by the length of the enumerated data.
Furthermore, vectors obtained after vectorization processing is performed on different data types of data in the search text can be spliced according to a set format to obtain a first text feature representation of the search text.
Alternatively, the first text feature representation of the retrieved text in the video retrieval request may also be determined based on a neural network model (e.g., a transform model). Specifically, the search text is input into the neural network model and processed by the neural network model to obtain a first text feature representation of the search text.
And S102, determining second text characteristic representation of the candidate video according to the subtitle information of the candidate video.
In this embodiment, the candidate videos are all videos in the video library. Optionally, the subtitle information of the candidate video may be acquired from a pre-constructed subtitle library, or the subtitle information may be extracted from the candidate video by using an Optical Character Recognition (OCR) model.
The second text feature representation is a representation of the feature of the subtitle information of the candidate video on the space, and can be represented in a vector or matrix form.
Alternatively, the subtitle information may include various data types, such as numeric data (e.g., duration of video), text data (e.g., title of movie or television show, name of actor, etc.), enumerated data (e.g., type of video), etc. Therefore, for convenience of processing, the present embodiment may unify different data in the subtitle information into a vector form. Optionally, different data types are vectorized in different ways.
For example, the numerical data may be processed directly or after normalization as a numerical vector. For text type data, word segmentation can be performed by adopting a word segmentation device (jieba), vectorization is performed by utilizing a word2vec model trained in advance to obtain semantic vectors of all words, and the semantic vectors of all words are averaged to obtain an average vector; or vectorization can be performed by using an Enhanced Language Representation from knowledge Enhanced semantic IntEgration (ERNIE). For enumerated data, one-hot vectors can be processed, wherein the length of the one-hot vectors is determined by the length of the enumerated data.
Furthermore, according to a set format, vectors obtained by vectorizing different data types of data in the subtitle information are spliced to obtain a second text feature representation of the subtitle information.
Alternatively, the second text characteristic representation of the subtitle information may be determined based on a neural network model (e.g., a transform model). Specifically, the subtitle information is input into the neural network model and processed by the neural network model to obtain a second text feature representation of the subtitle information.
S103, determining the visual characteristic representation of the candidate video according to the image information of the candidate video.
In this embodiment, the image information refers to image information of a video frame in the candidate video, and may include, but is not limited to, objective object information, person information, and motion-related information. The visual feature representation is a representation of features characterizing image information of the candidate video in space, which may be represented in a vector or a matrix form.
Alternatively, the visual feature representation of the candidate video may be determined from the image information of the candidate video based on a neural network model. Specifically, the image information of the candidate video is input into the neural network model, and the visual feature representation of the candidate video is obtained through model processing.
Further, key frames of the candidate videos can be determined, and visual feature representations of the candidate videos are determined according to image information of the key frames of the candidate videos on the basis of the neural network model.
And S104, selecting a target video from the candidate videos according to the first text characteristic representation, the visual characteristic representation of the candidate videos and the second text characteristic representation.
In this embodiment, the target video refers to a video closest to the search text, that is, a video that best meets the search intention of the user.
Optionally, for each candidate video, the visual feature representation and the second text feature representation of the candidate video are spliced to obtain an overall feature representation of the candidate video, and a similarity between the first text feature representation and the overall feature representation is determined. And further selecting a target video from the candidate videos according to the similarity between each candidate video and the search text, wherein the candidate video corresponding to the maximum similarity can be specifically used as the target video.
It should be noted that, when determining the similarity between the first text feature representation and the overall feature representation, the dimension of the first text feature representation and the dimension of the overall feature representation are normalized, that is, the feature representation with a low dimension may be made to be consistent with the dimension of the feature representation with a high dimension in a zero padding manner.
According to the technical scheme, the first text characteristic representation of the retrieval text in the video retrieval request is determined, then the second text characteristic representation of the candidate video is determined according to the subtitle information of the candidate video, the visual characteristic representation of the candidate video is further determined according to the image information of the candidate video, and finally the target video is selected from the candidate video according to the first text characteristic representation, the visual characteristic representation of the candidate video and the second text characteristic representation. According to the technical scheme, the video is represented from two levels of subtitle information and image information of the video, and the feature representation of the video is enriched, so that the accuracy of the video retrieval result is greatly improved.
Fig. 2 is a flowchart of another video retrieval method according to an embodiment of the present disclosure, and on the basis of the above embodiment, an alternative implementation is provided for further optimizing "determining a first text feature representation of a retrieved text in a video retrieval request" and "determining a second text feature representation of a candidate video according to subtitle information of the candidate video".
As shown in fig. 2, the method may specifically include:
s201, determining a first text characteristic representation of a search text in the video search request.
Optionally, entity identification may be performed on the search text in the video search request to obtain a first entity in the search text; performing entity chain pointing on a first entity to obtain description information of the first entity; and coding the retrieval text and the description information of the first entity to obtain a first text characteristic representation.
The first entity refers to characteristic fact information contained in the retrieval text, such as a person, an organization, a geographical location and the like. The description information refers to description information related to the entity.
For example, entity identification may be performed on a search text to obtain a first entity in the search text, an entity chain finger model is adopted to perform entity chain finger on the first entity, description information of the first entity is acquired from a knowledge graph, and then the search text and the description information of the first entity are encoded to obtain a first text feature representation.
For example, the search text and the obtained description information may be spliced, and then the spliced information may be encoded to obtain the first text feature representation of the search text. For another example, the search text and the obtained description information may be respectively encoded, and then the obtained encoding results are spliced to obtain the first text feature representation of the search text.
The method and the device can determine the first text characteristic representation of the retrieval text by combining the retrieval text and the description information of the first entity in the retrieval text, enrich the characteristics of the retrieval text and lay a foundation for accurately determining the target video subsequently.
S202, determining second text characteristic representation of the candidate video according to the subtitle information of the candidate video.
Optionally, entity identification may be performed on the subtitle information of the candidate video to obtain a second entity in the subtitle information; performing entity chain pointing on the second entity to obtain description information of the second entity; and coding the subtitle information and the description information of the second entity to obtain a second text characteristic representation of the candidate video.
The second entity refers to feature fact information contained in the subtitle information, such as a person, an organization, a geographical location, and the like.
For example, for each candidate video, an Optical Character Recognition (OCR) technique may be adopted to recognize subtitle information in the candidate video, perform entity Recognition on the subtitle information to obtain a second entity in the subtitle information, obtain description information of the second entity in the subtitle information through an entity chain finger model, and then encode the subtitle information and the obtained description information to obtain a second text feature representation of the candidate video.
For example, the subtitle information and the obtained description information may be spliced, and then the spliced information may be encoded to obtain the second text feature representation of the candidate video. For another example, the subtitle information and the obtained description information may be respectively encoded, and then the obtained encoding results are spliced to obtain a second text feature representation of the candidate video.
The second text characteristic representation is determined by combining the subtitle information and the description information of the second entity in the subtitle information, so that the second text characteristic is enriched, and a foundation is laid for accurately determining the target video subsequently.
And S203, determining the visual characteristic representation of the candidate video according to the image information of the candidate video.
And S204, selecting a target video from the candidate videos according to the first text characteristic representation, the visual characteristic representation of the candidate videos and the second text characteristic representation.
According to the technical scheme, the first text characteristic representation of the retrieval text in the video retrieval request is determined, then the second text characteristic representation of the candidate video is determined according to the subtitle information of the candidate video, the visual characteristic representation of the candidate video is further determined according to the image information of the candidate video, and finally the target video is selected from the candidate video according to the first text characteristic representation, the visual characteristic representation of the candidate video and the second text characteristic representation. According to the technical scheme, the video is represented from two levels of subtitle information and image information of the video, and the feature representation of the video is enriched, so that the accuracy of the video retrieval result is greatly improved.
Fig. 3 is a flowchart of still another video retrieval method provided according to an embodiment of the present disclosure, and on the basis of the above embodiment, an alternative implementation is provided for "determining visual feature representation of candidate video according to image information of candidate video".
As shown in fig. 3, the method may specifically include:
s301, determining a first text characteristic representation of a search text in the video search request.
S302, according to the subtitle information of the candidate video, second text characteristic representation of the candidate video is determined.
S303, extracting key frames from the candidate video, and coding the image information of the key frames to obtain the visual feature representation of the candidate video.
Specifically, for each candidate video, a key video frame may be extracted from the candidate video, and the key video frame may be encoded and feature aggregated to obtain a visual feature representation of the candidate video.
For example, each key video frame may be encoded, resulting in a feature representation for each key video frame; and taking the average value of the feature representations of all the key video frames as the visual feature representation of the candidate video. Alternatively, all key video frames may be input into the Transformer, and the visual feature representation of the candidate video may be obtained.
S304, selecting a target video from the candidate videos according to the first text feature representation, the visual feature representation of the candidate videos and the second text feature representation.
According to the technical scheme, the first text characteristic representation of the retrieval text in the video retrieval request is determined, then the second text characteristic representation of the candidate video is determined according to the subtitle information of the candidate video, the key frame is extracted from the candidate video, the image information of the key frame is encoded, the visual characteristic representation of the candidate video is obtained, and then the target video is selected from the candidate video according to the first text characteristic representation, the visual characteristic representation of the candidate video and the second text characteristic representation. According to the technical scheme, the visual feature representation is extracted based on the key video frames of the candidate videos, and compared with the visual feature representation extracted based on the whole candidate video, the calculation amount is reduced, so that the efficiency of determining the target video is improved.
Fig. 4 is a flowchart of still another video retrieval method provided according to an embodiment of the present disclosure, and on the basis of the above embodiment, an alternative implementation is provided for further optimizing "selecting a target video from candidate videos according to a first text feature representation, a visual feature representation of the candidate videos, and a second text feature representation".
As shown in fig. 4, the method may specifically include:
s401, determining a first text characteristic representation of a search text in the video search request.
S402, determining second text characteristic representation of the candidate video according to the subtitle information of the candidate video.
And S403, determining the visual feature representation of the candidate video according to the image information of the candidate video.
S404, determining a first similarity between the candidate video and the search text according to the similarity between the first text characteristic representation and the second text characteristic representation and the similarity between the first text characteristic representation and the visual characteristic representation.
In this embodiment, for each candidate video, the similarity between the second text feature representation of the candidate video and the first feature representation of the search text and the similarity between the visual feature representation of the candidate video and the first text feature representation of the search text are calculated, and the result of adding the two similarities is taken as the first similarity between the candidate video and the search text.
Furthermore, two similarity weights can be given, the two similarities and the corresponding weights are multiplied respectively, the multiplied results are added, and the added result is used as the first similarity between the candidate video and the search text.
The similarity between the first text feature representation and the second feature representation and the similarity between the first text feature representation and the visual feature representation may be characterized by cosine similarity or may be characterized by a Jaccard (Jaccard) similarity coefficient, which is not specifically limited herein.
It should be noted that, when determining the similarity between the first text feature representation and the second feature representation and the similarity between the first text feature representation and the visual feature representation, the dimension of the first text feature representation and the dimension of the second feature representation are respectively normalized, and the first text feature representation and the visual feature representation are normalized, that is, the feature representation with a low dimension may be made to be consistent with the dimension of the feature representation with a high dimension by way of zero padding.
S405, selecting a target video from the candidate videos according to the first similarity between the candidate videos and the retrieval text.
In this embodiment, the target video is selected from the candidate videos according to the first similarity between each candidate video and the search text. Specifically, the candidate video corresponding to the maximum first similarity may be used as the target video.
According to the technical scheme, the first text characteristic representation of the search text in the video search request is determined, then the second text characteristic representation of the candidate video is determined according to the subtitle information of the candidate video, the visual characteristic representation of the candidate video is determined according to the image information of the candidate video, further the first similarity between the candidate video and the search text is determined according to the similarity between the first text characteristic representation and the second text characteristic representation and the similarity between the first text characteristic representation and the visual characteristic representation, and finally the target video is selected from the candidate video according to the first similarity between the candidate video and the search text. According to the technical scheme, the matching between the retrieved text and the candidate video is measured from two perspectives of the text by combining the similarity between the text and the similarity between the text and the vision, so that the accuracy of the video retrieval result is improved.
Fig. 5 is a flowchart of still another video retrieval method provided according to an embodiment of the present disclosure, and on the basis of the above embodiment, an alternative implementation is provided for further optimizing "determining a first similarity between a candidate video and a retrieved text according to a similarity between a first text feature representation and a second text feature representation and a similarity between the first text feature representation and a visual feature representation".
As shown in fig. 5, the method may specifically include:
s501, determining a first text characteristic representation of a search text in the video search request.
And S502, determining a second text characteristic representation of the candidate video according to the subtitle information of the candidate video.
S503, according to the image information of the candidate video, the visual feature representation of the candidate video is determined.
S504, determining a first similarity between the candidate video and the search text according to the similarity between the first text feature representation and the second text feature representation and the similarity between the first text feature representation and the visual feature representation.
According to an implementation mode, character recognition can be carried out on candidate videos to obtain character information in the candidate videos; determining a second similarity between the candidate video and the retrieval text according to the character information in the candidate video and the character information associated with the retrieval text; and determining a first similarity between the candidate video and the retrieval text according to the similarity between the first text characteristic representation and the second text characteristic representation, the similarity between the first text characteristic representation and the visual characteristic representation and the second similarity.
The personal information may include, but is not limited to, a name of a person, a name of a movie and a title of a show, and a name of a general art.
Specifically, the candidate video may be subjected to person identification based on a face identification technology to obtain person information in the candidate video, and then the cosine similarity or a Jaccard (Jaccard) similarity coefficient may be used to calculate the similarity between the person information in the candidate video and the person information associated with the search text to obtain a second similarity between the candidate video and the search text. Or, the hit rate of the personal information in the candidate video in the personal information associated with the search text can be determined, and the hit rate is used as the second similarity between the candidate video and the search text.
Furthermore, a first similarity between the candidate video and the search text is determined according to the similarity between the first text feature representation and the second text feature representation, the similarity between the first text feature representation and the visual feature representation, and the second similarity.
For example, the similarity between the first text feature representation and the second text feature representation, the similarity between the first text feature representation and the visual feature representation, and the second similarity may be added, and the added result may be used as the first similarity between the candidate video and the search text.
Further, the similarity between the first text feature representation and the second text feature representation, the similarity between the first text feature representation and the visual feature representation, and the second similarity may be given different weights, respectively. And further multiplying the similarity between the first text characteristic representation and the second text characteristic representation, the similarity between the first text characteristic representation and the visual characteristic representation and the second similarity with the respective corresponding weights, adding the multiplied results, and taking the added result as the first similarity between the candidate video and the search text.
It can be understood that the introduction of the character information further broadens the measurement angle between the visual texts, and further improves the accuracy of the video retrieval result.
In another implementation mode, keywords can be extracted from the retrieval text, and the target occurrence times of the keywords in the candidate video are determined; determining a third similarity between the candidate video and the retrieval text according to the target occurrence frequency, and the maximum occurrence frequency and the minimum occurrence frequency of the words in the candidate video; and determining a first similarity between the candidate video and the retrieval text according to the similarity between the first text feature representation and the second text feature representation, the similarity between the first text feature representation and the visual feature representation and the third similarity.
For example, keywords may be extracted from the search text based on a keyword extraction technique such as a word graph model or a language model (PLM), the target occurrence number of the keywords in the candidate video is determined, and a third similarity between the candidate video and the search text is determined according to the target occurrence number and the maximum occurrence number and the minimum occurrence number of the words in the candidate video, which may be determined by the following formula:
Figure BDA0003422155990000111
wherein, simcountRepresenting a third similarity between the candidate video and the search text, cnt representing the number of occurrences of the targetminRepresenting the minimum number of occurrences, cnt, of a word in a candidate videomaxRepresenting the maximum number of occurrences of a word in the candidate video.
Further, the similarity between the first text feature representation and the second text feature representation, the similarity between the first text feature representation and the visual feature representation, and the third similarity may be added, and the added result may be used as the first similarity between the candidate video and the search text.
Further, the similarity between the first text feature representation and the second text feature representation, the similarity between the first text feature representation and the visual feature representation, and the third similarity may be given different weights, respectively. And further multiplying the similarity between the first text characteristic representation and the second text characteristic representation, the similarity between the first text characteristic representation and the visual characteristic representation and the third similarity with the respective corresponding weights, adding the multiplied results, and taking the added result as the first similarity between the candidate video and the retrieval text.
It can be understood that the number of times of occurrence of the keywords is introduced, so that the measurement angle between the visual texts is further widened, and the accuracy of the video retrieval result is further improved.
In yet another embodiment, a first similarity between the candidate video and the search text is determined based on a similarity between the first text feature representation and the second text feature representation, a similarity between the first text feature representation and the visual feature representation, the second similarity, and the third similarity.
And S505, selecting a target video from the candidate videos according to the first similarity between the candidate videos and the retrieval text.
According to the technical scheme, the first text characteristic representation of the search text in the video search request is determined, then the second text characteristic representation of the candidate video is determined according to the subtitle information of the candidate video, the visual characteristic representation of the candidate video is determined according to the image information of the candidate video, further the first similarity between the candidate video and the search text is determined according to the similarity between the first text characteristic representation and the second text characteristic representation and the similarity between the first text characteristic representation and the visual characteristic representation, and finally the target video is selected from the candidate video according to the first similarity between the candidate video and the search text. According to the technical scheme, the matching between the retrieved text and the candidate video is measured from two perspectives of the text by combining the similarity between the text and the similarity between the text and the vision, so that the accuracy of the video retrieval result is improved.
Fig. 6 is a schematic diagram of a video retrieval process provided according to an embodiment of the present disclosure. The embodiment provides a preferred example on the basis of the above embodiment, and relates to a model training process.
As shown in fig. 6, the specific process is as follows:
in this embodiment, the first text feature representation, the second text feature representation, and the visual feature representation in the above embodiments may be obtained by a target feature extraction model. Illustratively, the target feature extraction model may be trained as follows.
Firstly, a large number of video and text data pairs containing videos and titles can be acquired from the internet, and the acquired video and text data pairs are used as training sample data. And then preprocessing the training sample data, inputting the preprocessed data into the initial model for training, and thus obtaining a target feature extraction model. The initial model comprises a first feature extraction submodel, a second feature extraction submodel and a third feature extraction submodel, wherein the first feature extraction submodel is used for extracting the features of the titles in the video data pair, the second feature extraction submodel is used for extracting the features of the subtitle information in the video data pair, and the third feature extraction submodel is used for extracting the features of the image information in the video data pair; for example, the first feature extraction submodel, the second feature extraction submodel, and the third feature extraction submodel may be a transform model.
The specific training process may be: and performing entity identification on the title in the text data pair to obtain a first entity of the title, further performing entity chain pointing on the first entity to obtain description information of the first entity, and inputting the title and the description information of the first entity into the first feature extraction submodel to obtain a first text feature representation.
Meanwhile, preprocessing is carried out on the video part in the video data pair, specifically, subtitle information in the video is obtained based on an OCR technology, and image information in the video is obtained based on ffmpeg frame cutting. And then carrying out entity identification on the caption information to obtain a second entity of the caption information, further carrying out entity chain pointing on the second entity to obtain description information of the second entity, and inputting the caption information and the description information of the second entity into a second feature extraction submodel to obtain a second text feature representation. And simultaneously, inputting the image information into a third feature extraction sub-model to obtain visual feature representation.
And determining the similarity between the first text characteristic representation and the second text characteristic representation, namely a first loss of the first sub-characteristic extraction model and the second sub-characteristic extraction model, and the similarity between the first text characteristic representation and the visual characteristic representation, namely a second loss of the first sub-characteristic extraction model and the third sub-characteristic extraction model, and further taking the sum of the first loss and the second loss as the total loss of the initial model. Further, training the initial model according to the total loss to obtain a target feature extraction model.
After the target feature extraction model is obtained, preprocessing the retrieval text to obtain description information of a first entity; and preprocessing the candidate video to obtain the description information of the second entity of the subtitle information of the candidate video and the image information of the candidate video. And inputting the retrieval text, the description information of the first entity, the caption information, the description information of the second entity and the image information into a target feature extraction model to obtain a first text feature representation, a second text feature representation and a visual feature representation, and selecting a target video from the candidate videos according to the first text feature representation, the visual feature representation of the candidate videos and the second text feature representation.
According to the technical scheme, the first text characteristic representation of the retrieval text in the video retrieval request is determined, then the second text characteristic representation of the candidate video is determined according to the subtitle information of the candidate video, the visual characteristic representation of the candidate video is further determined according to the image information of the candidate video, and finally the target video is selected from the candidate video according to the first text characteristic representation, the visual characteristic representation of the candidate video and the second text characteristic representation. According to the technical scheme, the video is represented from two levels of subtitle information and image information of the video, and the feature representation of the video is enriched, so that the accuracy of the video retrieval result is greatly improved.
Fig. 7 is a schematic structural diagram of a video retrieval apparatus according to an embodiment of the present disclosure. The embodiment of the disclosure is suitable for the situation of how to perform video retrieval. The device can be implemented by software and/or hardware, can implement the video retrieval method of any embodiment of the disclosure, and can be integrated into an electronic device with a video retrieval function, such as a server.
As shown in fig. 7, the video search apparatus 700 includes:
a first text feature determination module 701, configured to determine a first text feature representation of a search text in a video search request;
a second text feature determining module 702, configured to determine a second text feature representation of the candidate video according to the subtitle information of the candidate video;
a visual characteristic determining module 703, configured to determine a visual characteristic representation of the candidate video according to image information of the candidate video;
and a target video selection module 704, configured to select a target video from the candidate videos according to the first text feature representation, the visual feature representations of the candidate videos, and the second text feature representation.
According to the technical scheme, the first text characteristic representation of the retrieval text in the video retrieval request is determined, then the second text characteristic representation of the candidate video is determined according to the subtitle information of the candidate video, the visual characteristic representation of the candidate video is further determined according to the image information of the candidate video, and finally the target video is selected from the candidate video according to the first text characteristic representation, the visual characteristic representation of the candidate video and the second text characteristic representation. According to the technical scheme, the video is represented from two levels of subtitle information and image information of the video, and the feature representation of the video is enriched, so that the accuracy of the video retrieval result is greatly improved.
Further, the first text feature determining module 701 includes:
the first entity determining unit is used for carrying out entity identification on the retrieval text in the video retrieval request to obtain a first entity in the retrieval text;
the first description information determining unit is used for carrying out entity chain pointing on the first entity to obtain the description information of the first entity;
and the first text characteristic determining unit is used for coding the retrieval text and the description information of the first entity to obtain a first text characteristic representation.
Further, the second text feature determination module 702 includes:
the second entity determining unit is used for carrying out entity identification on the subtitle information of the candidate video to obtain a second entity in the subtitle information;
a second description information determining unit, configured to perform entity chain pointing on a second entity to obtain description information of the second entity;
and the second text characteristic determining unit encodes the subtitle information and the description information of the second entity to obtain a second text characteristic representation of the candidate video.
Further, the visual characteristic determination module 703 is specifically configured to:
and extracting key frames from the candidate videos, and coding image information of the key frames to obtain visual feature representation of the candidate videos.
Further, the target video selecting module 704 includes:
a first similarity determining unit, configured to determine a first similarity between the candidate video and the search text according to a similarity between the first text feature representation and the second text feature representation and a similarity between the first text feature representation and the visual feature representation;
and the target video selecting unit is used for selecting the target video from the candidate videos according to the first similarity between the candidate videos and the retrieval text.
Further, the first similarity determining unit is specifically configured to:
performing character recognition on the candidate video to obtain character information in the candidate video;
determining a second similarity between the candidate video and the retrieval text according to the character information in the candidate video and the character information associated with the retrieval text;
and determining a first similarity between the candidate video and the retrieval text according to the similarity between the first text characteristic representation and the second text characteristic representation, the similarity between the first text characteristic representation and the visual characteristic representation and the second similarity.
Further, the first similarity determining unit is further specifically configured to:
extracting keywords from the retrieval text, and determining the target occurrence times of the keywords in the candidate video;
determining a third similarity between the candidate video and the retrieval text according to the target occurrence frequency, and the maximum occurrence frequency and the minimum occurrence frequency of the words in the candidate video;
and determining a first similarity between the candidate video and the retrieval text according to the similarity between the first text feature representation and the second text feature representation, the similarity between the first text feature representation and the visual feature representation and the third similarity.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related video data, subtitle data, text data and the like all accord with the regulations of related laws and regulations, and do not violate the custom of public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic apparatus 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the video retrieval method. For example, in some embodiments, the video retrieval method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When loaded into RAM 803 and executed by the computing unit 801, a computer program may perform one or more steps of the video retrieval method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the video retrieval method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge map technology and the like.
Cloud computing (cloud computing) refers to a technology system that accesses a flexibly extensible shared physical or virtual resource pool through a network, where resources may include servers, operating systems, networks, software, applications, storage devices, and the like, and may be deployed and managed in a self-service manner as needed. Through the cloud computing technology, high-efficiency and strong data processing capacity can be provided for technical application and model training of artificial intelligence, block chains and the like.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (17)

1. A video retrieval method, comprising:
determining a first text characteristic representation of a search text in a video search request;
determining a second text characteristic representation of the candidate video according to the subtitle information of the candidate video;
determining visual feature representation of the candidate video according to the image information of the candidate video;
and selecting a target video from the candidate videos according to the first text feature representation, the visual feature representation of the candidate videos and the second text feature representation.
2. The method of claim 1, wherein the determining a first text-feature representation of the retrieved text in the video retrieval request comprises:
entity identification is carried out on a retrieval text in the video retrieval request to obtain a first entity in the retrieval text;
performing entity chain pointing on the first entity to obtain description information of the first entity;
and coding the retrieval text and the description information of the first entity to obtain the first text feature representation.
3. The method of claim 1, wherein the determining a second text feature representation of the candidate video from the caption information of the candidate video comprises:
performing entity identification on the subtitle information of the candidate video to obtain a second entity in the subtitle information;
performing entity chain pointing on the second entity to obtain description information of the second entity;
and coding the subtitle information and the description information of the second entity to obtain a second text feature representation of the candidate video.
4. The method of claim 1, wherein said determining a visual feature representation of the candidate video from the image information of the candidate video comprises:
and extracting key frames from the candidate videos, and coding image information of the key frames to obtain visual feature representation of the candidate videos.
5. The method of claim 1, wherein the selecting a target video from the candidate videos according to the first text feature representation, the visual feature representation of the candidate videos, and the second text feature representation comprises:
determining a first similarity between the candidate video and the search text according to the similarity between the first text feature representation and the second text feature representation and the similarity between the first text feature representation and the visual feature representation;
and selecting a target video from the candidate videos according to a first similarity between the candidate videos and the retrieval text.
6. The method of claim 5, wherein the determining a first similarity between the candidate video and the retrieved text based on a similarity between the first text feature representation and the second text feature representation and a similarity between the first text feature representation and the visual feature representation comprises:
performing character recognition on the candidate video to obtain character information in the candidate video;
determining a second similarity between the candidate video and the retrieval text according to the character information in the candidate video and the character information associated with the retrieval text;
determining a first similarity between the candidate video and the search text according to the similarity between the first text feature representation and the second text feature representation, the similarity between the first text feature representation and the visual feature representation, and the second similarity.
7. The method of claim 5, wherein the determining a first similarity between the candidate video and the retrieved text based on a similarity between the first text feature representation and the second text feature representation and a similarity between the first text feature representation and the visual feature representation comprises:
extracting keywords from the retrieval text, and determining the target occurrence times of the keywords in the candidate video;
determining a third similarity between the candidate video and the retrieval text according to the target occurrence frequency and the maximum occurrence frequency and the minimum occurrence frequency of the words in the candidate video;
determining a first similarity between the candidate video and the search text according to the similarity between the first text feature representation and the second text feature representation, the similarity between the first text feature representation and the visual feature representation, and the third similarity.
8. A video retrieval apparatus comprising:
the first text characteristic determining module is used for determining a first text characteristic representation of a search text in the video search request;
the second text characteristic determining module is used for determining second text characteristic representation of the candidate video according to the subtitle information of the candidate video;
the visual characteristic determining module is used for determining the visual characteristic representation of the candidate video according to the image information of the candidate video;
and the target video selection module is used for selecting a target video from the candidate videos according to the first text characteristic representation, the visual characteristic representation of the candidate videos and the second text characteristic representation.
9. The apparatus of claim 8, wherein the first text feature determination module comprises:
the first entity determining unit is used for carrying out entity identification on a retrieval text in the video retrieval request to obtain a first entity in the retrieval text;
a first description information determining unit, configured to perform entity chain pointing on the first entity to obtain description information of the first entity;
and the first text characteristic determining unit is used for coding the retrieval text and the description information of the first entity to obtain the first text characteristic representation.
10. The apparatus of claim 8, wherein the second text feature determination module comprises:
the second entity determining unit is used for carrying out entity identification on the subtitle information of the candidate video to obtain a second entity in the subtitle information;
a second description information determining unit, configured to perform entity chain pointing on the second entity to obtain description information of the second entity;
and the second text characteristic determining unit is used for coding the subtitle information and the description information of the second entity to obtain a second text characteristic representation of the candidate video.
11. The apparatus of claim 8, wherein the visual characteristic determination module is specifically configured to:
and extracting key frames from the candidate videos, and coding image information of the key frames to obtain visual feature representation of the candidate videos.
12. The apparatus of claim 8, wherein the target video selection module comprises:
a first similarity determining unit, configured to determine a first similarity between the candidate video and the search text according to a similarity between the first text feature representation and the second text feature representation and a similarity between the first text feature representation and the visual feature representation;
and the target video selecting unit is used for selecting a target video from the candidate videos according to the first similarity between the candidate videos and the retrieval text.
13. The apparatus according to claim 12, wherein the first similarity determining unit is specifically configured to:
performing character recognition on the candidate video to obtain character information in the candidate video;
determining a second similarity between the candidate video and the retrieval text according to the character information in the candidate video and the character information associated with the retrieval text;
determining a first similarity between the candidate video and the search text according to the similarity between the first text feature representation and the second text feature representation, the similarity between the first text feature representation and the visual feature representation, and the second similarity.
14. The apparatus according to claim 12, wherein the first similarity determining unit is further specifically configured to:
extracting keywords from the retrieval text, and determining the target occurrence times of the keywords in the candidate video;
determining a third similarity between the candidate video and the retrieval text according to the target occurrence frequency and the maximum occurrence frequency and the minimum occurrence frequency of the words in the candidate video;
determining a first similarity between the candidate video and the search text according to the similarity between the first text feature representation and the second text feature representation, the similarity between the first text feature representation and the visual feature representation, and the third similarity.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video retrieval method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the video retrieval method of any one of claims 1-7.
17. A computer program product comprising a computer program which, when executed by a processor, implements a video retrieval method according to any one of claims 1-7.
CN202111566652.8A 2021-12-20 2021-12-20 Video retrieval method, device, equipment and storage medium Pending CN114282049A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111566652.8A CN114282049A (en) 2021-12-20 2021-12-20 Video retrieval method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111566652.8A CN114282049A (en) 2021-12-20 2021-12-20 Video retrieval method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114282049A true CN114282049A (en) 2022-04-05

Family

ID=80873310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111566652.8A Pending CN114282049A (en) 2021-12-20 2021-12-20 Video retrieval method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114282049A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880517A (en) * 2022-05-27 2022-08-09 支付宝(杭州)信息技术有限公司 Method and device for video retrieval

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880517A (en) * 2022-05-27 2022-08-09 支付宝(杭州)信息技术有限公司 Method and device for video retrieval

Similar Documents

Publication Publication Date Title
US20210312139A1 (en) Method and apparatus of generating semantic feature, method and apparatus of training model, electronic device, and storage medium
CN113033622B (en) Training method, device, equipment and storage medium for cross-modal retrieval model
CN112559800B (en) Method, apparatus, electronic device, medium and product for processing video
CN114942984B (en) Pre-training and image-text retrieval method and device for visual scene text fusion model
CN114612759B (en) Video processing method, video query method, model training method and model training device
CN114861889B (en) Deep learning model training method, target object detection method and device
CN113407610B (en) Information extraction method, information extraction device, electronic equipment and readable storage medium
EP4134921A1 (en) Method for training video label recommendation model, and method for determining video label
CN113360700A (en) Method, device, equipment and medium for training image-text retrieval model and image-text retrieval
CN112528641A (en) Method and device for establishing information extraction model, electronic equipment and readable storage medium
CN114861059A (en) Resource recommendation method and device, electronic equipment and storage medium
CN114861758A (en) Multi-modal data processing method and device, electronic equipment and readable storage medium
CN114037059A (en) Pre-training model, model generation method, data processing method and data processing device
CN111988668B (en) Video recommendation method and device, computer equipment and storage medium
CN114282049A (en) Video retrieval method, device, equipment and storage medium
CN113408280A (en) Negative example construction method, device, equipment and storage medium
CN112906368A (en) Industry text increment method, related device and computer program product
CN115098729A (en) Video processing method, sample generation method, model training method and device
CN112559713B (en) Text relevance judging method and device, model, electronic equipment and readable medium
CN115292506A (en) Knowledge graph ontology construction method and device applied to office field
CN113239215A (en) Multimedia resource classification method and device, electronic equipment and storage medium
CN113010782A (en) Demand amount acquisition method and device, electronic equipment and computer readable medium
CN114328855A (en) Document query method and device, electronic equipment and readable storage medium
CN116524516A (en) Text structured information determining method, device, equipment and storage medium
CN115952403A (en) Method and device for evaluating performance of object, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination