CN116361511A - Video retrieval method, device and equipment of composite semantics and storage medium - Google Patents

Video retrieval method, device and equipment of composite semantics and storage medium Download PDF

Info

Publication number
CN116361511A
CN116361511A CN202310325087.9A CN202310325087A CN116361511A CN 116361511 A CN116361511 A CN 116361511A CN 202310325087 A CN202310325087 A CN 202310325087A CN 116361511 A CN116361511 A CN 116361511A
Authority
CN
China
Prior art keywords
semantic
vector
vectors
fusion
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310325087.9A
Other languages
Chinese (zh)
Inventor
梁超
何姜杉
李星翰
宋维晞
张玥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202310325087.9A priority Critical patent/CN116361511A/en
Publication of CN116361511A publication Critical patent/CN116361511A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video retrieval method, a device, equipment and a storage medium of composite semantics, wherein the method comprises the following steps: extracting visual feature vectors of videos and text feature vectors at corresponding moments, wherein the features comprise characters, behaviors and scenes; fusing the visual feature vector and the text feature vector to obtain a fusion feature; inputting the fusion features into a multi-layer perceptron to obtain character, behavior and scene semantic state vectors; and carrying out semantic fusion on the character, the behavior and the scene semantic state vector to obtain a vector reflecting the semantic score, and taking the vector as a multi-mode composite semantic video retrieval result. The information utilization rate is effectively improved by introducing the multi-mode fusion method, and the retrieval accuracy is improved.

Description

Video retrieval method, device and equipment of composite semantics and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for video retrieval with composite semantics.
Background
With the rapid development of multimedia technology and the popularization of the internet, the scale of video data is rapidly increased, and the efficiency of manually browsing and manually searching video content is low in face of various forms and huge amounts of video data, so that the efficient searching of video meeting expectations, namely video searching, through a machine becomes a current urgent problem.
In recent years, deep learning technology is widely applied in the field of video retrieval, and video retrieval performance is greatly improved aiming at single mode and single semantics, such as characters, behaviors, scenes and the like. Along with the continuous improvement of the demands of users, the query content of the users tends to be detailed, and the search needs to be positioned on a specific composite semantic composed of single semantics of characters, places, actions and the like, so that the video search oriented to the composite semantic is gradually mainstream. The current video retrieval technology is difficult to directly meet the requirement of composite semantic video retrieval. The reason for this is mainly the following two points: (1) The video contains multi-mode information such as images, texts, audios and the like, but the current video retrieval is mainly based on visual features to carry out similarity sorting, the audio and text information contained in the video is ignored to a great extent, but for the video retrieval of composite semantics, speech lines, background environment speech and the like of characters contain important semantic information, and the video content to be retrieved is difficult to fully mine by only developing the single-mode video retrieval based on vision. (2) At present, the composite semantic video retrieval research mainly utilizes different technologies to retrieve single semantic instances, and then performs score fusion (such as weighting, filtering and product fusion) according to independent retrieval results. The strategy has the problems that each single semantic retrieval branch is mutually independent, the composite semantic retrieval result is obtained only by fusing the scores of different retrieval branches, the relevance and the mutual influence among the semantics of the composite retrieval are ignored, and the quality of the composite semantic retrieval is difficult to ensure.
Therefore, how to improve the precision of the composite semantic search is a technical problem to be solved at present.
Disclosure of Invention
The invention mainly aims to provide a video retrieval method, device and equipment of composite semantics and a storage medium, which effectively improve the utilization rate of information and further improve the retrieval accuracy by introducing a multi-mode fusion method.
In a first aspect, the present application provides a video retrieval method of compound semantics, the method comprising the steps of:
extracting visual feature vectors of videos and text feature vectors at corresponding moments, wherein the features comprise characters, behaviors and scenes;
fusing the visual feature vector and the text feature vector to obtain a fusion feature;
inputting the fusion features into a multi-layer perceptron to obtain character, behavior and scene semantic state vectors;
and carrying out semantic fusion on the character, the behavior and the scene semantic state vector to obtain a vector reflecting the semantic score, and taking the vector as a multi-mode composite semantic video retrieval result.
With reference to the first aspect, as an optional implementation manner, tensor product operation is performed on the character, behavior and scene semantic state vectors two by two to obtain three combined semantic state vectors;
quantum observation is carried out on the three combined semantic state vectors to obtain probability vectors;
and carrying out maximum pooling layer processing on the three probability vectors to obtain vectors reflecting the semantic scores, and taking the vectors as multi-mode composite semantic video retrieval results.
With reference to the first aspect, as an optional implementation manner, according to the formula:
Figure BDA0004153013770000031
W(|φ k >)=max{P mn (|φ k >) -calculating a final semantic score, wherein |ψ m >,|ψ n >Respectively representing the m-th class and the n-th class semantic state vectors |ψ mn >Representing a combination of a class m semantic state vector and a class n semantic state vector, +.>
Figure BDA0004153013770000032
Representing tensor product operation, P mn (|φ k >) Represents the combined state vector |ψ mn >Collapse to the basis vector |phi k >W (|phi) k >) Semantic scores representing the corresponding base states.
With reference to the first aspect, as an optional implementation manner, according to the obtained visual state vector and the text state vector, a point-by-point multiplication of the vectors is utilized to obtain interference items corresponding to the visual state vector and the text state vector;
and taking the visual state vector, the text state vector and the corresponding interference item as modal characteristics, and inputting the modal characteristics into a preset multi-modal fusion network to perform modal fusion to obtain fusion characteristics.
With reference to the first aspect, as an optional implementation manner, the method is according to the formula
Figure BDA0004153013770000033
Computing video features after modality fusion, wherein f vision For visual features, f text For text features, f fusion For the video features after mode fusion, the term "+.is the point-wise multiplication of vectors, α, β,/->
Figure BDA0004153013770000034
Is a super parameter.
With reference to the first aspect, as an optional implementation manner, the preprocessing is performed on the video, where the preprocessing includes: scene segmentation, shot segmentation and key frame extraction;
and extracting the character feature vector, the behavior feature vector and the scene feature vector of the vision in the processed video by using the C3D model, and extracting the character feature vector, the behavior feature vector and the scene feature vector of the text at the corresponding moment by using the BERT model, wherein the text comprises a script and a line.
With reference to the first aspect, as an optional implementation manner, the visual feature vector and the text feature vector at the corresponding moment are mapped to the corresponding d-wikipedia semantic space by using a convolutional neural network and are converted into a visual state vector and a text state vector in a common space.
In a second aspect, the present application provides a video retrieval apparatus of compound semantics, the apparatus comprising:
the extraction unit is used for extracting visual feature vectors of the video and text feature vectors at corresponding moments, wherein the features comprise characters, behaviors and scenes;
the fusion unit is used for fusing the visual feature vector and the text feature vector to obtain fusion features;
the processing unit is used for inputting the fusion characteristics into the multi-layer perceptron to obtain character, behavior and scene semantic state vectors;
the computing unit is used for carrying out semantic fusion on the character, the behavior and the scene semantic state vector to obtain a vector reflecting the semantic score, and the vector is used as a multi-mode composite semantic video retrieval result.
In a third aspect, the present application further provides an electronic device, including: a processor; a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of the first aspects.
In a fourth aspect, the present application also provides a computer readable storage medium storing computer program instructions which, when executed by a computer, cause the computer to perform the method of any one of the first aspects.
The application provides a video retrieval method, device and equipment of composite semantics and a storage medium, wherein the method comprises the following steps: extracting visual feature vectors of videos and text feature vectors at corresponding moments, wherein the features comprise characters, behaviors and scenes; fusing the visual feature vector and the text feature vector to obtain a fusion feature; inputting the fusion features into a multi-layer perceptron to obtain character, behavior and scene semantic state vectors; and carrying out semantic fusion on the character, the behavior and the scene semantic state vector to obtain a vector reflecting the semantic score, and taking the vector as a multi-mode composite semantic video retrieval result. The information utilization rate is effectively improved by introducing the multi-mode fusion method, and the retrieval accuracy is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flow chart of a video retrieval method of composite semantics provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a video retrieval device with composite semantics provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of a composite semantic fusion process provided in an embodiment of the present application;
fig. 4 is a schematic diagram of an electronic device provided in an embodiment of the present application;
fig. 5 is a schematic diagram of a computer readable program medium provided in an embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.
The embodiment of the application provides a video retrieval method, device and equipment of composite semantics and a storage medium, which effectively improve the utilization rate of information and improve the retrieval accuracy by introducing a multi-mode fusion method.
In order to achieve the technical effects, the general idea of the application is as follows:
a video retrieval method of compound semantics, the method comprising the steps of:
s101: and extracting visual feature vectors of the video and text feature vectors at corresponding moments, wherein the features comprise characters, behaviors and scenes.
S102: and fusing the visual feature vector and the text feature vector to obtain a fusion feature.
S103: and inputting the fusion characteristics into a multi-layer perceptron to obtain character, behavior and scene semantic state vectors.
S104: semantic fusion is carried out on the character, behavior and scene semantic state vectors to obtain vectors reflecting semantic scores, and the vectors are used as multi-mode composite semantic video retrieval results
Embodiments of the present application are described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 is a flowchart of a video retrieval method of composite semantics provided by the present invention, and as shown in fig. 1, the method includes the steps of:
and step S101, extracting visual feature vectors of the video and text feature vectors at corresponding moments, wherein the features comprise characters, behaviors and scenes.
Specifically, preprocessing of scene segmentation, shot segmentation and key frame extraction is performed on the acquired video, character feature vectors, behavior feature vectors and scene feature vectors of vision in the processed video are extracted by using a C3D model, character feature vectors, behavior feature vectors and scene feature vectors of texts at corresponding moments are extracted by using a BERT model, and the texts comprise scripts and lines.
For the convenience of understanding and illustration, firstly, preprocessing a video to complete scene segmentation, lens segmentation and key frame extraction, on the obtained key frame, extracting visual characters, behaviors and scene feature vectors by using a C3D model, and extracting characters, behaviors and scene feature vectors of texts (including scripts, lines and the like) at corresponding moments by using a BERT model.
In one embodiment, after extracting the visual feature vector of the video and the text feature vector at the corresponding time, the visual feature vector and the text feature vector at the corresponding time are mapped to the corresponding d-dimensional hilbert semantic space by using a convolutional neural network, and are converted into a visual state vector and a text state vector in a common space.
It can be understood that the obtained visual and text feature vectors are mapped to the corresponding hilbert semantic space simultaneously by using the CNN, so as to obtain three groups (characters, behaviors and scenes) of state vectors in a common space. In this process, reLU is used as an activation function to ensure all feature values
Figure BDA0004153013770000081
And (3) being a non-negative number, wherein m is the dimension of the visual feature and the text feature, namely obtaining a state vector: />
Figure BDA0004153013770000082
Figure BDA0004153013770000083
Finally, normalization of the state vector>
Figure BDA0004153013770000084
Figure BDA0004153013770000085
Where CNN represents a neural network and X represents a fusion of visual vectors, text vectors, and distracters. m can take values of 1, 2 and 3, and corresponds to semantic vector characters, behaviors and scenes respectively. CNN (X) represents the output obtained after passing through the network, i.e., the semantic state vector, reLU is an activation function commonly used in machine learning, and is a unique structure in the neural network.
It should be noted that, the vector is a representation mode in the multidimensional space, and tensor product operation can be performed between two vectors to obtain a new vector, where the new vector contains all information of the two old vectors. There are other operations in addition, different ones representing a new interaction between the two information. By modeling semantic space, content in a video is projected into space and information is represented encoded with vectors.
Feature vector: can be understood as being divided into "features" and "vectors": "feature" means that the vector contains unique semantic information, e.g., a vector for personal information is referred to as a personal feature vector, and a vector containing behavioral information is referred to as a behavioral feature vector.
State vector it is understood that mathematical tools describing the state of a quantum system in quantum computing. Only if the familiar mathematical representation is converted into a state vector representation unique to quantum computation, can subsequent operations be performed according to the rules of quantum computation.
And S102, fusing the visual feature vector and the text feature vector to obtain fusion features.
Specifically, a multi-mode fusion network based on a quantum interference model is utilized to fuse visual features and text features, and it can be understood that according to the obtained visual state vector and text state vector, the interference items corresponding to the visual state vector and the text state vector are obtained by utilizing point-by-point multiplication of the vectors; and taking the visual state vector, the text state vector and the corresponding interference item as modal characteristics, and inputting the modal characteristics into a preset multi-modal fusion network to perform modal fusion to obtain fusion characteristics.
It should be noted that the multi-mode (Multimedia modalities) refers to a plurality of different media elements, such as images, audio, video, text, etc., which can carry a certain size of information. Primarily referred to herein as visual and text modalities.
It should be noted that, the visual state vector and the text vector are input to the network, and are each a modal feature vector. Although the interference term does not belong to any mode, it is still used as a mode feature vector input, and evidence exists that the network input containing processed high-level information (the interference term in the text) can effectively improve the model effect.
For convenience of understanding and illustration, three groups (characters, behaviors and scenes) of state vectors in a common space are obtained, interference items corresponding to the two state vectors are obtained, a group of visual state vectors, text state vectors and the corresponding interference items are respectively regarded as a modal characteristic, and the modal characteristic is input into a multi-modal fusion network to obtain the fusion characteristic. It should be noted that a group here is actually treated identically (according to a point-wise multiplication of vectors) in a sequence of three groups, one group by one.
In one embodiment, the resulting visual and text features are entered as a modality by the formula:
Figure BDA0004153013770000101
computing video features after modality fusion, wherein f vision For visual features, f text For text features, f fusion For the video features after mode fusion, the term "+.is the point-wise multiplication of vectors, α, β,/->
Figure BDA0004153013770000102
Is a super parameter.
It should be further noted that the interference term is a fusion of information including two different modes (visual and text), and is obtained by performing a specified operation on a state vector of the two modes, and reflects a common feature in the two modes. It is convenient to understand and exemplify, for example, a visual representation of a video and a text representation that contains speech information of a character in the video, which may correspond to opening and closing of the mouth of the character, actions of the character's limb language, etc. in the visual representation. The text representation and the visual representation are taken to be integrated in a certain operation, and a new vector representation is created to carry text information and visual information at the same time, and the vector is called an interference item. The method can be understood to be the mouth shape information of the task speaking in the video, which not only contains the visual shape, but also reflects the speech information of the character to a certain extent.
And step 103, inputting the fusion characteristics into a multi-layer perceptron to obtain character, behavior and scene semantic state vectors.
Specifically, the obtained fusion features are input into a multi-layer perceptron to obtain vectors containing video semantic information. The multi-layer perceptron consists of a hidden layer and an output layer, and different semantics correspond to different multi-layer perceptrons, and the mathematical expression is as follows:
Figure BDA0004153013770000103
Figure BDA0004153013770000104
Figure BDA0004153013770000111
Figure BDA0004153013770000112
m∈{character,action,scene}
t=1,2,3...
the label of the semantic m at the time t is obtained through a model as follows:
Figure BDA0004153013770000113
x in the above formula t Is the fusion feature vector at the time t,
Figure BDA0004153013770000114
hidden layer vector for semantic m at time t, < ->
Figure BDA0004153013770000115
Output layer vector for t-moment semantic m, < ->
Figure BDA0004153013770000116
For the probability vector of the normalized t moment semantic m,/->
Figure BDA0004153013770000117
Hidden layer weights and biases for semantic m, respectively, +.>
Figure BDA0004153013770000118
The output layer weights and biases for semantic m, respectively. />
Figure BDA0004153013770000119
Cross entropy loss for training a multi-layer perceptron, < >>
Figure BDA00041530137700001110
Is a one-hot encoding of sample semantics.
It should be noted that, the semantic state vector: the semantics have three kinds of semantics: characters, behaviors and scenes are information formed by high-level abstract processing on basic modal information, and state vectors are mathematical representations in quantum computation.
And step S104, carrying out semantic fusion on the character, the behavior and the scene semantic state vector to obtain a vector reflecting the semantic score, and taking the vector as a multi-mode composite semantic video retrieval result.
Specifically, tensor product operation is performed on character, behavior and scene semantic state vectors two by two to obtain three combined semantic state vectors, multiple quantum observation is performed on the three combined semantic state vectors to obtain probability vectors, and maximum pooling layer processing is performed on the three probability vectors to obtain vectors reflecting semantic scores to serve as multi-mode composite semantic video retrieval results.
It can be understood that the three obtained semantic state vectors are subjected to tensor product operation two by two to obtain three new combined semantic (character-behavior, character-scene, behavior-scene) state vectors, the three obtained combined semantic state vectors are subjected to quantum observation for multiple times to obtain three probability vectors, the three obtained probability vectors are listed as a matrix, and a vector which finally reflects the semantic score is obtained through a row maximum pooling layer.
To facilitate understanding and illustration, semantic processing is performed on the obtained state vectors with different semanticsFusion (character, behavior and scene semantic state vectors) firstly takes three semantic state vectors to perform tensor product operation defined in quantum computation two by two, and three new combined semantic state vectors are obtained. And then carrying out multiple quantum observation operation on three new combined semantic state vectors: encoding basis vectors |phi according to final possible semantic combinations k >Respectively obtaining the probability of collapsing the combined semantic state vector to the base vector, and listing the probability as a matrix of k multiplied by 3; and then, the matrix passes through a row maximization pooling layer to obtain a vector which finally reflects the semantic score.
According to the formula:
Figure BDA0004153013770000121
Figure BDA0004153013770000122
W(|φ k >)=max{P mn (|φ k >) -calculating a final semantic score, wherein |ψ n >,|ψ n >Respectively representing the m-th class and the n-th class semantic state vectors |ψ mn >Representing a combination of a class m semantic state vector and a class n semantic state vector, +.>
Figure BDA0004153013770000123
Representing tensor product operation, P mn (|φ k >) Represents the combined state vector |ψ mn >Collapse to the basis vector |phi k >W (|phi) k >) A semantic score representing the corresponding base state, where (m+.n).
It should be noted that, the score vector represents semantic information corresponding to each video, and the user inputs some query sentences when searching the video, and the sentences contain the semantic information. The user query is projected into the set semantic space, and by calculating the "distance" between the user query vector and each video score vector, we can get an initial ranking result. While the contribution made is primarily that the video representation vector contains more information and can be interpreted, this can improve the final ranking result. It will be appreciated that video content is represented by modeling a more complex semantic space, a user's query input is projected into our modeled space, and the video required by the user is accurately retrieved by calculating the distance of the query input from the individual video vectors.
It will be appreciated that the advantages of the present invention compared to the prior art include: aiming at the composite semantic video retrieval problem, introducing a multi-mode fusion method; based on the related theory of quantum computation, the method improves the multi-modal fusion and the compound semantic fusion, effectively improves the utilization rate of information, and further improves the retrieval accuracy.
Referring to fig. 2, fig. 2 is a schematic diagram of a video retrieval device with composite semantics provided by the present invention, and as shown in fig. 2, the device includes:
extraction unit 201: the method is used for extracting visual feature vectors of videos and text feature vectors at corresponding moments, wherein the features comprise characters, behaviors and scenes.
Fusion unit 202: the method is used for fusing the visual feature vector and the text feature vector to obtain fusion features.
Processing unit 203: the fusion feature is used for inputting the fusion feature into a multi-layer perceptron to obtain character, behavior and scene semantic state vectors.
The calculation unit 204: the method is used for carrying out semantic fusion on the character, the behavior and the scene semantic state vector to obtain a vector reflecting the semantic score, and the vector is used as a multi-mode composite semantic video retrieval result.
Further, in one possible implementation manner, the calculating unit 204 is further configured to perform tensor product operation on the person, the behavior and the scene semantic state vectors two by two to obtain three combined semantic state vectors;
quantum observation is carried out on the three combined semantic state vectors to obtain probability vectors;
and carrying out maximum pooling layer processing on the three probability vectors to obtain vectors reflecting the semantic scores, and taking the vectors as multi-mode composite semantic video retrieval results.
Further, in a possible implementation manner, the calculating unit 204 is further configured to:
Figure BDA0004153013770000141
W(|φ k >)=max{P mn (|φ k >) -calculating a final semantic score, wherein |ψ m >,|ψ n >Respectively representing the m-th class and the n-th class semantic state vectors |ψ mn >Representing a combination of a class m semantic state vector and a class n semantic state vector, +.>
Figure BDA0004153013770000142
Representing tensor product operation, P mn (|φ k >) Represents the combined state vector |ψ mn >Collapse to the basis vector |phi k >W (|phi) k >) Semantic scores representing the corresponding base states.
Further, in one possible implementation manner, the fusion unit 202 is further configured to obtain, according to the obtained visual state vector and the text state vector, an interference term corresponding to the visual state vector and the text state vector by using point-to-point multiplication of the vectors;
and taking the visual state vector, the text state vector and the corresponding interference item as modal characteristics, and inputting the modal characteristics into a preset multi-modal fusion network to perform modal fusion to obtain fusion characteristics.
Further, in a possible implementation manner, the calculating unit 204 is further configured to perform the following formula
Figure BDA0004153013770000143
Computing video features after modality fusion, wherein f vision For visual features, f text For text features, f fusion For the video features after mode fusion, the term "+.is the point-wise multiplication of vectors, α, β,/->
Figure BDA0004153013770000144
Is a super parameter.
Further, in a possible implementation manner, the extracting unit 201 is further configured to perform preprocessing on the video, where the preprocessing includes: scene segmentation, shot segmentation and key frame extraction;
and extracting the character feature vector, the behavior feature vector and the scene feature vector of the vision in the processed video by using the C3D model, and extracting the character feature vector, the behavior feature vector and the scene feature vector of the text at the corresponding moment by using the BERT model, wherein the text comprises a script and a line.
Further, in a possible implementation manner, the extracting unit 201 further includes a converting unit, configured to map the visual feature vector and the text feature vector at the corresponding time to the corresponding d-wikipedia semantic space by using a convolutional neural network, and convert the visual feature vector and the text feature vector into a visual state vector and a text state vector in a common space.
Referring to fig. 3, fig. 3 is a schematic diagram of a composite semantic fusion process provided by the present invention, as shown in fig. 3:
firstly, preprocessing a video to finish scene segmentation, shot segmentation and key frame extraction, extracting visual characters, behaviors and scene feature vectors by using a C3D model on the obtained key frame, and extracting characters, behaviors and scene feature vectors of texts (including scripts, lines and the like) at corresponding moments by using a BERT model.
According to the obtained visual features and text features, the interference items corresponding to the two state vectors are obtained through point-by-point multiplication of the vectors, the obtained visual features, text features and the interference items corresponding to the two state vectors are respectively regarded as one modal feature, a multi-modal fusion network is input to obtain fusion features, it is to be noted that the obtained visual and text feature vectors are mapped to the corresponding Hilbert semantic space simultaneously by using CNN to obtain three groups (characters, behaviors and scenes) of state vectors in a common space, one group of visual state vectors, text state vectors and the corresponding interference items are respectively regarded as one modal feature, the multi-modal fusion network is input to obtain the fusion features, and it is to be understood that one group of the three groups of the fusion features are actually treated identically in sequence one group. (obtaining interference items corresponding to the two state vectors according to the point-by-point multiplication of the vectors, respectively regarding the visual state vector, the text state vector and the corresponding interference items as a mode characteristic, and inputting the multi-mode fusionNetwork) inputs the obtained fusion features into a multi-layer perceptron to obtain three vectors (characters, behaviors and scenes) containing video semantic information, namely character semantic state vectors, behavior semantic state vectors and scene semantic state vectors. And carrying out tensor product operation on the obtained character semantic state vector, the behavior semantic state vector and the scene semantic state vector two by two to obtain three new combined semantic (character-behavior, character-scene and behavior-scene) state vectors. Performing multiple quantum observation operations on three new combined semantic state vectors: encoding basis vectors |phi according to final possible semantic combinations k >And respectively obtaining the probability of collapsing the combined semantic state vector to the base vector, listing the combined semantic state vector to be a matrix of k multiplied by 3, and then passing the matrix through a row maximum pooling layer to obtain a vector finally reflecting the semantic score.
It can be understood that preprocessing is performed on the video, visual features and text features of corresponding segments are extracted, a final score vector is obtained by using a multi-mode fusion model based on quantum interference and a composite semantic fusion model based on a quantum multi-body system, and the final score vector is applied to retrieval, so that retrieval accuracy is improved.
An electronic device 400 according to such an embodiment of the invention is described below with reference to fig. 4. The electronic device 400 shown in fig. 4 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 4, the electronic device 400 is embodied in the form of a general purpose computing device. The components of electronic device 400 may include, but are not limited to: the at least one processing unit 410, the at least one memory unit 420, and a bus 430 connecting the various system components, including the memory unit 420 and the processing unit 410.
Wherein the storage unit stores program code that is executable by the processing unit 410 such that the processing unit 410 performs steps according to various exemplary embodiments of the present invention described in the above-described "example methods" section of the present specification.
The storage unit 420 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 421 and/or cache memory 422, and may further include Read Only Memory (ROM) 423.
The storage unit 420 may also include a program/utility 424 having a set (at least one) of program modules 425, such program modules 425 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 430 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.
The electronic device 400 may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 400, and/or with any device (e.g., router, modem, etc.) that enables the electronic device 400 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 450. Also, electronic device 400 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 460. As shown, the network adapter 460 communicates with other modules of the electronic device 400 over the bus 430. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 400, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
According to an aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device.
Referring to fig. 5, a program product 500 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
In summary, the method, device, equipment and storage medium for video retrieval of composite semantics provided by the application, wherein the method comprises the following steps: extracting visual feature vectors of videos and text feature vectors at corresponding moments, wherein the features comprise characters, behaviors and scenes; fusing the visual feature vector and the text feature vector to obtain a fusion feature; inputting the fusion features into a multi-layer perceptron to obtain character, behavior and scene semantic state vectors; and carrying out semantic fusion on the character, the behavior and the scene semantic state vector to obtain a vector reflecting the semantic score, and taking the vector as a multi-mode composite semantic video retrieval result. The information utilization rate is effectively improved by introducing the multi-mode fusion method, and the retrieval accuracy is improved.
The foregoing is merely a specific embodiment of the application to enable one skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims (10)

1. A method for video retrieval of compound semantics, comprising:
extracting visual feature vectors of videos and text feature vectors at corresponding moments, wherein the features comprise characters, behaviors and scenes;
fusing the visual feature vector and the text feature vector to obtain a fusion feature;
inputting the fusion features into a multi-layer perceptron to obtain character, behavior and scene semantic state vectors;
and carrying out semantic fusion on the character, the behavior and the scene semantic state vector to obtain a vector reflecting the semantic score, and taking the vector as a multi-mode composite semantic video retrieval result.
2. The method of claim 1, wherein semantically fusing the character, behavior, and scene semantic state vectors to obtain a vector reflecting a semantic score as a multi-modal composite semantic video retrieval result, comprising:
tensor product operation is carried out on the character, behavior and scene semantic state vectors two by two, so that three combined semantic state vectors are obtained;
quantum observation is carried out on the three combined semantic state vectors to obtain probability vectors;
and carrying out maximum pooling layer processing on the three probability vectors to obtain vectors reflecting the semantic scores, and taking the vectors as multi-mode composite semantic video retrieval results.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises,
according to the formula:
Figure FDA0004153013750000011
Figure FDA0004153013750000012
W(|φ k >)=max{P mn (|φ k >) -calculating a final semantic score, wherein |ψ m >,|ψ n >Respectively representing the m-th class and the n-th class semantic state vectors |ψ mn >Representing a combination of a class m semantic state vector and a class n semantic state vector, +.>
Figure FDA0004153013750000013
Representing tensor product operation, P mn (|φ k >) Represents the combined state vector |ψ mn >Collapse to the basis vector |phi k >W (|phi) k >) Semantic scores representing the corresponding base states.
4. The method of claim 1, wherein fusing the visual feature vector and the text feature vector to obtain a fused feature comprises:
according to the obtained visual state vector and text state vector, utilizing point-by-point multiplication of the vectors to obtain interference items corresponding to the visual state vector and the text state vector;
and taking the visual state vector, the text state vector and the corresponding interference item as modal characteristics, and inputting the modal characteristics into a preset multi-modal fusion network to perform modal fusion to obtain fusion characteristics.
5. The method according to claim 4, wherein:
according to the formula
Figure FDA0004153013750000021
Computing video features after modality fusion, wherein f vision For visual features, f text For text features, f fusion For the video features after mode fusion, the term "+.is the point-wise multiplication of vectors, α, β,/->
Figure FDA0004153013750000022
Is a super parameter.
6. The method according to claim 1, wherein extracting the visual feature vector of the video and the text feature vector at the corresponding time, comprises:
preprocessing video, the preprocessing comprising: scene segmentation, shot segmentation and key frame extraction;
and extracting the character feature vector, the behavior feature vector and the scene feature vector of the vision in the processed video by using the C3D model, and extracting the character feature vector, the behavior feature vector and the scene feature vector of the text at the corresponding moment by using the BERT model, wherein the text comprises a script and a line.
7. The method according to claim 1, wherein the extracting the visual feature vector of the video and the text feature vector at the corresponding time comprises:
and mapping the visual feature vector and the text feature vector at the corresponding moment to the corresponding d-dimensional Hilbert semantic space by using a convolutional neural network, and converting the visual feature vector and the text feature vector into a visual state vector and a text state vector in a common space.
8. A video retrieval device of compound semantics, comprising:
the extraction unit is used for extracting visual feature vectors of the video and text feature vectors at corresponding moments, wherein the features comprise characters, behaviors and scenes;
the fusion unit is used for fusing the visual feature vector and the text feature vector to obtain fusion features;
the processing unit is used for inputting the fusion characteristics into the multi-layer perceptron to obtain character, behavior and scene semantic state vectors;
the computing unit is used for carrying out semantic fusion on the character, the behavior and the scene semantic state vector to obtain a vector reflecting the semantic score, and the vector is used as a multi-mode composite semantic video retrieval result.
9. An electronic device, the electronic device comprising:
a processor;
a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1 to 7.
10. A computer readable storage medium, characterized in that it stores computer program instructions, which when executed by a computer, cause the computer to perform the method according to any one of claims 1 to 7.
CN202310325087.9A 2023-03-29 2023-03-29 Video retrieval method, device and equipment of composite semantics and storage medium Pending CN116361511A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310325087.9A CN116361511A (en) 2023-03-29 2023-03-29 Video retrieval method, device and equipment of composite semantics and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310325087.9A CN116361511A (en) 2023-03-29 2023-03-29 Video retrieval method, device and equipment of composite semantics and storage medium

Publications (1)

Publication Number Publication Date
CN116361511A true CN116361511A (en) 2023-06-30

Family

ID=86918825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310325087.9A Pending CN116361511A (en) 2023-03-29 2023-03-29 Video retrieval method, device and equipment of composite semantics and storage medium

Country Status (1)

Country Link
CN (1) CN116361511A (en)

Similar Documents

Publication Publication Date Title
CN111177569B (en) Recommendation processing method, device and equipment based on artificial intelligence
JP7193252B2 (en) Captioning image regions
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN114694076A (en) Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
WO2023160472A1 (en) Model training method and related device
CN110569359B (en) Training and application method and device of recognition model, computing equipment and storage medium
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
JP2015162244A (en) Methods, programs and computation processing systems for ranking spoken words
CN111985243B (en) Emotion model training method, emotion analysis device and storage medium
JP2023012493A (en) Language model pre-training method, apparatus, device, and storage medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
JP2022534375A (en) Text intelligent cleaning method, apparatus and computer readable storage medium
WO2024067276A1 (en) Video tag determination method and apparatus, device and medium
CN113392265A (en) Multimedia processing method, device and equipment
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
CN113128431B (en) Video clip retrieval method, device, medium and electronic equipment
CN113705315A (en) Video processing method, device, equipment and storage medium
CN117033609A (en) Text visual question-answering method, device, computer equipment and storage medium
CN113918710A (en) Text data processing method and device, electronic equipment and readable storage medium
CN112100360A (en) Dialog response method, device and system based on vector retrieval
CN115827865A (en) Method and system for classifying objectionable texts by fusing multi-feature map attention mechanism
CN116361511A (en) Video retrieval method, device and equipment of composite semantics and storage medium
CN114490946A (en) Xlnet model-based class case retrieval method, system and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination